[PATCH] Smaller PMCs - 2nd try

0 views
Skip to first unread message

Leopold Toetsch

unread,
May 24, 2003, 6:04:10 AM5/24/03
to P6I
Ok, here we go, smaller PMCs second edition.

- PMC is 8 bytes smaller (- 3 pointers + 1 *pmc_ext)
- PMC_EXT has metadata, synchronize, next_for_GC
- the PMC_EXT structure gets allocated for all but PerlScalars
(this could be optimized, allocate it for aggregates only)
- it is kept in an unmanaged small object pool
- reimplemented the skip-logic per pool
(if a DOD run doesn't yield at least replenish_level free objects,
the next DOD run is skipped if this pool needs more objects)
- when properties get attached to a PerlScalar, the extended structure
gets appended.
- passes all tests: parrot, imcc, perl6 except t/compiler/1_14
(because of inf/Inf spelling)

Some timings for examples/benchmarks/stress.pasm (unoptimized build, -j)

Original: 1.20s
PMC_EXT: 0.95s
skip: 0.80s

The patch contains Matt Fowles' list and imcc changes too.

Next step could be to move data to PMC_EXT, though this needs some
macros and renaming.

Comments welcome.
leo

spmc-2.patch

Nicholas Clark

unread,
May 24, 2003, 6:33:35 AM5/24/03
to Leopold Toetsch, P6I
On Sat, May 24, 2003 at 12:04:10PM +0200, Leopold Toetsch wrote:
> Some timings for examples/benchmarks/stress.pasm (unoptimized build, -j)
>
> Original: 1.20s
> PMC_EXT: 0.95s
> skip: 0.80s

Possibly stupid question - do you know if using an optimised build with -j
makes much difference? It seems strange posting benchmarking results without
using the optimiser, but then again I've no idea how much of stress.pasm
sits inside the hand crafted JIT code.

Nicholas Clark

Leopold Toetsch

unread,
May 24, 2003, 6:52:07 AM5/24/03
to Nicholas Clark, P6I
Nicholas Clark wrote:

> On Sat, May 24, 2003 at 12:04:10PM +0200, Leopold Toetsch wrote:
>
>>Some timings for examples/benchmarks/stress.pasm (unoptimized build, -j)
>>
>>Original: 1.20s
>>PMC_EXT: 0.95s
>>skip: 0.80s
>>
>
> Possibly stupid question - do you know if using an optimised build with -j
> makes much difference?


Yep 0.6 vs 0.8s. The stress.pasm contains PMC access, JIT does call the
vtable methods which benefit from -O3. Also PMC allocation and freeing
(new_pmc ... -> get_free_object and add_free_object) are a lot faster.

> It seems strange posting benchmarking results without
> using the optimiser, but then again I've no idea how much of stress.pasm
> sits inside the hand crafted JIT code.


I just wanted to show the relative difference. But here are some -O3
timings whith JIT runtime:

Original: 1.0s (stress.pasm)
spmc-2: 0.6s

Original: 720/s (life.pasm)
spmc-2: 793/s

The stress2.pasm seems to be slightly slower now (1.44 vs 1.5s) - more
aggregate access in this benchmark.


> Nicholas Clark

leo

Leopold Toetsch

unread,
May 27, 2003, 2:03:36 PM5/27/03
to perl6-i...@perl.org
Some additional remarks:

The "memset" in smallobject.c is not necessary on Linux. mmap() (which
obviously gets called for memalign - at least for this arena size)
does clear the memory. We need some tests, from which size memory is
cleard for malloc and memalign.
I tossed the memset for now and saved ~450.000 L2-misses or ~0.2 s.

I did some optimizations in list.c to avoid generating sparse lists:
When its clear that the whole list gets filled the programmer/compiler
shall insert a
set P0, I0 # set size of list
before setting the first element. The rules, which grow_type is chosen
when are straightened and better documented (in list.c).

So with these two refinements, I have new numbers for some stress tests:

stress stress1 stress2 life
CVS 1.00 1.44 721
SPMC 0.60 12.0 1.50 793
my current 0.33 8.8 1.24 800
perl 5.8.0 0.6 12.0 2.41

stress1 does 10 times the (10+20+20) allocations of 200.000 elements.
I'll check it in soon.

stress tests are in seconds, life test is generations/sec, -O3
compiled parrot, JIT runtime (-P isn't slower here), i386/linux.

SPMC ... the patch with smaller (-8 bytes) PMC
current ... additionally DOD flags moved to arena + above

Have fun,
leo

Acknowledgments: valgrind and ccache are really great tools. Get them
if you don't already have them.

Dan Sugalski

unread,
May 27, 2003, 2:33:07 PM5/27/03
to l...@toetsch.at, perl6-i...@perl.org
At 8:03 PM +0200 5/27/03, Leopold Toetsch wrote:
>Some additional remarks:
>
>The "memset" in smallobject.c is not necessary on Linux. mmap() (which
>obviously gets called for memalign - at least for this arena size)
>does clear the memory. We need some tests, from which size memory is
>cleard for malloc and memalign.
>I tossed the memset for now and saved ~450.000 L2-misses or ~0.2 s.

While I didn't see any memsets in smallobject.c, I'm really, *really*
uncomfortable counting on implied behavior. There's no reason that
mmap has to return zeroed memory, and none of the man pages I have
claim that it does. While it *probably* does, it certainly doesn't
have to, and I'd definitely not count that it does, nor that it's
actually called implicitly. I've been burned by stuff like that
before.

Having said that, if we can actually guarantee certain behaviours,
I'm all for conditional code to exploit them.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Uri Guttman

unread,
May 27, 2003, 3:10:23 PM5/27/03
to Dan Sugalski, l...@toetsch.at, perl6-i...@perl.org
>>>>> "DS" == Dan Sugalski <d...@sidhe.org> writes:

>> obviously gets called for memalign - at least for this arena size)
>> does clear the memory. We need some tests, from which size memory is
>> cleard for malloc and memalign.
>> I tossed the memset for now and saved ~450.000 L2-misses or ~0.2 s.

DS> While I didn't see any memsets in smallobject.c, I'm really, *really*
DS> uncomfortable counting on implied behavior. There's no reason that
DS> mmap has to return zeroed memory, and none of the man pages I have
DS> claim that it does. While it *probably* does, it certainly doesn't
DS> have to, and I'd definitely not count that it does, nor that it's
DS> actually called implicitly. I've been burned by stuff like that before.

you can mmap chunks from /dev/zero and get the behavior you want. it
shouldn't do any L1 stuff until you reference the actual pages and then
they will be zero filled for you on demand. i also don't recall mmap
ever guaranteeing any level of cleanliness.

solaris says this about /dev/zero:

Mapping a zero special file creates a zero-initialized
unnamed memory object of a length equal to the length of the
mapping and rounded up to the nearest page size as returned
by sysconf.

linux says nothing about mmap on /dev/zero in either man page.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Dan Sugalski

unread,
May 27, 2003, 4:13:11 PM5/27/03
to Uri Guttman, l...@toetsch.at, perl6-i...@perl.org
At 3:10 PM -0400 5/27/03, Uri Guttman wrote:
> >>>>> "DS" == Dan Sugalski <d...@sidhe.org> writes:
>
> >> obviously gets called for memalign - at least for this arena size)
> >> does clear the memory. We need some tests, from which size memory is
> >> cleard for malloc and memalign.
> >> I tossed the memset for now and saved ~450.000 L2-misses or ~0.2 s.
>
> DS> While I didn't see any memsets in smallobject.c, I'm really, *really*
> DS> uncomfortable counting on implied behavior. There's no reason that
> DS> mmap has to return zeroed memory, and none of the man pages I have
> DS> claim that it does. While it *probably* does, it certainly doesn't
> DS> have to, and I'd definitely not count that it does, nor that it's
> DS> actually called implicitly. I've been burned by stuff like that before.
>
>you can mmap chunks from /dev/zero and get the behavior you want. it
>shouldn't do any L1 stuff until you reference the actual pages and then
>they will be zero filled for you on demand. i also don't recall mmap
>ever guaranteeing any level of cleanliness.

It seems that what we really ought to do is make zero-page memory
allocation system specific, or jsut rely on calloc and hope that the
underlying system libraries do that as efficiently as they can.

Leopold Toetsch

unread,
May 27, 2003, 5:00:11 PM5/27/03
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski <d...@sidhe.org> wrote:
> At 8:03 PM +0200 5/27/03, Leopold Toetsch wrote:
>>... . We need some tests, from which size memory is

>>cleard for malloc and memalign.

>>I tossed the memset for now and saved ~450.000 L2-misses or ~0.2 s.

> While I didn't see any memsets in smallobject.c, I'm really, *really*
> uncomfortable counting on implied behavior.

First, I did write above sentence - "We need some tests...".
Second, in malloc.c (UTSL) its clearly stated that mmap() does yield
zeroed memory. So, for linux and this size of allocation its save to
assume that the memset is not necessary. That's it. LEA (and glibc)
malloc have the same assumptions.

The memset() is in #22337.

Dan Sugalski <d...@sidhe.org> wrote:
> It seems that what we really ought to do is make zero-page memory
> allocation system specific, or jsut rely on calloc and hope that the
> underlying system libraries do that as efficiently as they can.

We currently have malloc and calloc used. Some of the smaller malloced
items are cleared then with low impact. But what we really need (for my
patch) is zeroed and aligned (rather big) memory and we need this fast.

The current (CVS) does calloc

#define POOL_MAX_BYTES 65536*128

for the biggest pools (used in stress tests). Actually this should read
arena and ARENA_MAX_BYTES, BTW, its a per pool->arena size limit.

This memory is already cleared by the operating system. I did (some
months ago) submit a test program, from which size on mem is returned
cleared.

The problem is not, "counting on implied behaviour", the problem, is can
we have a memalign() for a specific size, that yields zeroed memory,
where size is rather big - in regions where calloc() AFAIK all return
zeroed memory.

If calloc() would use memset() to clear the memory you could forget all
performance.

$ man memalign
No manual entry for memalign

:-((

leo

Leopold Toetsch

unread,
May 28, 2003, 11:43:00 AM5/28/03
to l...@toetsch.at, perl6-i...@perl.org
Leopold Toetsch <l...@toetsch.at> wrote:
>>>... . We need some tests, from which size memory is
>>>cleard for malloc and memalign.

Here is a small program, which could be put into a test.

Are there systems out there, without memalign, where malloc.c can not be
linked with?

/*
* test clean memory threshold
*/

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

int main(int argc, char *argv[])
{
char *buf;
size_t size;
int j, u;

if (argc != 3) {
printf("usage: test [malloc|memalign] hexsize\n");
return 1;
}
++argv;
size = (size_t)strtoul(argv[1], NULL, 16);
buf = malloc(size);
for (j = 0; j < size; j++)
buf[j] = 0xff;
free(buf);
if (strcmp(*argv, "malloc") == 0) {
printf("malloc\nsize\tclean\tadr\n");
buf = malloc(size);
}
else if (strcmp(*argv, "memalign") == 0) {
printf("memalign\nsize\tclean\tadr\n");
if (size & (size -1 )) {
printf("not a power of 2\n");
return 1;
}
buf = memalign(size, size);
}
else {
printf("usage: test [malloc|memalign] size\n");
return 1;
}
for (j = u = 0; j < size; j++)
if (buf[j]) {
u = 1;
break;
}

printf("0x%x\t%s\t%p\n",
size, u ? "n": "y", buf);
free(buf);
return 0;
}
/*
* Local variables:
* c-indentation-style: bsd
* c-basic-offset: 4
* indent-tabs-mode: nil
* End:
*
* vim: expandtab shiftwidth=4:
*/

leo

Leopold Toetsch

unread,
May 29, 2003, 7:48:43 AM5/29/03
to perl6-i...@perl.org, Dan Sugalski
Appended is a refined version of #22337 which it obsoletes.

Key features are:
1) DOD flags (live, on_free_list, ...) are kept in the arenas, 1 nibble
per object
2) arena memory is aquired per memalign() to be able to calculate arena
from object address
3) free_list is per arena now
4) PMC size is 20 byte (i386, GC_DEBUG turned off) for ultimate speed

1) bis 3) can be toggled in include/parrot/pobj.h with
#define ARENA_DOD_FLAGS 1

Turning this define on has currently rather low impact on performance on
my system, only stress3.pasm seems to indicate, that it is the Right
Thing (IMHO).

With ARENA_DOD_FLAGS turned on, objects are not touched in the fast
paths of DOD. Only objects that get destroyed are pulled into the cache
- but our free_list handling does this anyway. This reduces cache misses
by a fair amount.

Keeping the free_list per arena should give better cache coherency as
add_free_object/get_free_object deal with objects of one arena only and
not with all objects in the whole pool. This effect should show up in
long running programs, when the free_list(s) gets more and more
scattered other arena/pool memory.

In dod.c there is also code to free unused arenas (when a program first
allocates a lot of objects and then has a low usage count) But this
doesn't improve performance - though it should IMHO ;-)

Calculating the location of the DOD flags is of course rather costy
compared to just touching the objects itself. But as processors get
faster, avoiding memory accesses (and cache misses) becomes more
important, so I think, this is the way to go.

Test results especially on fast machines welcome.
Comments welcome.

TIA & have fun,
leo

dod-flags-2.patch

Dan Sugalski

unread,
May 29, 2003, 8:17:52 AM5/29/03
to Leopold Toetsch, perl6-i...@perl.org
At 5:43 PM +0200 5/28/03, Leopold Toetsch wrote:
>Leopold Toetsch <l...@toetsch.at> wrote:
>>>>... . We need some tests, from which size memory is
>>>>cleard for malloc and memalign.
>
>Here is a small program, which could be put into a test.
>
>Are there systems out there, without memalign, where malloc.c can not be
>link

The test is reasonable, but I can still see places where it could
fail, and I'm just not comfortable with it, unfortunately.

Dan Sugalski

unread,
May 29, 2003, 8:16:07 AM5/29/03
to l...@toetsch.at, perl6-i...@perl.org
At 11:00 PM +0200 5/27/03, Leopold Toetsch wrote:
>Dan Sugalski <d...@sidhe.org> wrote:
>> At 8:03 PM +0200 5/27/03, Leopold Toetsch wrote:
>>>... . We need some tests, from which size memory is
>>>cleard for malloc and memalign.
>
>>>I tossed the memset for now and saved ~450.000 L2-misses or ~0.2 s.
>
>> While I didn't see any memsets in smallobject.c, I'm really, *really*
>> uncomfortable counting on implied behavior.
>
>First, I did write above sentence - "We need some tests...".
>Second, in malloc.c (UTSL) its clearly stated that mmap() does yield
>zeroed memory. So, for linux and this size of allocation its save to
>assume that the memset is not necessary. That's it. LEA (and glibc)
>malloc have the same assumptions.

I generally wouldn't trust comments in the source for standard
libraries if they're not backed up by actual documentation. Internals
change, and I'd *really* hate to find ourselves in a position where
someone upgrades a system libc, or does an OS point release upgrade,
and all of a sudden parrot starts breaking. That wouldn't be at all
good.

If we can find documentation (not comments in source) that guarantees
that we can get zero-filled memory, that's fine and we should. If
not, we can't count on it.

>The problem is not, "counting on implied behaviour", the problem, is can
>we have a memalign() for a specific size, that yields zeroed memory,
>where size is rather big - in regions where calloc() AFAIK all return
>zeroed memory.
>
>If calloc() would use memset() to clear the memory you could forget all
>performance.

I think you'll find that calloc does use memset, or something like
it, in some cases. When it hands back memory that was on the freelist
it undoubtedly does. When it allocates new memory to satisfy the
request it probably doesn't, just asking for zero-fill newly faulted
memory from the system.

The problem we're going to run into is that this is all terribly
system-dependent. Our requests are large enough to reasonably grab
(and later release) new memory from the OS, rather than relying on
any sort of memory allocation free list, but every system does it
differently.

That, as much as anything, argues for an entry in platform.c to get
and return large sections of memory. I know it's reasonably doable on
a lot of platforms, just potentially differently everywhere.

>$ man memalign
>No manual entry for memalign

Yeah, it seems depressingly rare.

Andy Switala

unread,
May 29, 2003, 9:34:19 AM5/29/03
to perl6-i...@perl.org
I found this online: http://unixhelp.ed.ac.uk/CGI/man-cgi?posix_memalign.
Note in particular, "For all three routines, the memory is not zeroed."
Regarding the lack of "man memalign," have you tried texinfo instead?
(There isn't a linux machine handy right now so I can't check myself.)
--Andy

Leopold Toetsch

unread,
May 29, 2003, 12:12:58 PM5/29/03
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski wrote:

[ zerofilled aligned memory ]

>
> That, as much as anything, argues for an entry in platform.c to get and
> return large sections of memory. I know it's reasonably doable on a lot
> of platforms, just potentially differently everywhere.


BTW the zero filled memory is not that important. When we put the
objects onto the free list or take them off it, we are touching their
memory. In the latter case the object will be used thereafter, so that's
the point to wash the memory.

More important is the proper alignment of the blocks. But if we get
zeroed memory for free, we can save some extra cycles too.


>> $ man memalign
>> No manual entry for memalign
>
>
> Yeah, it seems depressingly rare.


linux/drivers/char/mem.c and malloc.c:


#define MMAP(addr, size, prot, flags) ((dev_zero_fd < 0) ? \
(dev_zero_fd = open("/dev/zero", O_RDWR), \
mmap((addr), (size), (prot), (flags), dev_zero_fd, 0)) : \
mmap((addr), (size), (prot), (flags), dev_zero_fd, 0))

and:

/*
Standard unix mmap using /dev/zero clears memory so calloc doesn't
need to.
*/

That's it
leo

Steve Fink

unread,
May 30, 2003, 4:22:01 AM5/30/03
to Leopold Toetsch, Dan Sugalski, perl6-i...@perl.org
sfink@foxglove:~/parrot/languages/perl6% man memalign
POSIX_MEMALIGN(3) Linux Programmer's Manual POSIX_MEMALIGN(3)

NAME
posix_memalign, memalign, valloc - Allocate aligned memory

SYNOPSIS
#include <stdlib.h>

int posix_memalign(void **memptr, size_t alignment, size_t size);
void *memalign(size_t boundary, size_t size);
void *valloc(size_t size);

DESCRIPTION

The function posix_memalign() allocates size bytes and places
the address of the allocated memory in *memptr. The address of
the allocated memory will be a multiple of alignment, which
must be a power of two and a multiple of sizeof(void *).

The obsolete function memalign() allocates size bytes and
returns a pointer to the allocated memory. The memory address
will be a multiple of boundary, which must be a power of two.

The obsolete function valloc() allocates size bytes and returns
a pointer to the allocated memory. The memory address will be a
multiple of the page size. It is equivalent to
memalign(sysconf(_SC_PAGESIZE),size).

For all three routines, the memory is not zeroed.

.
.
.

AVAILABILITY
The functions memalign() and valloc() have been available in
all Linux libc libraries. The function posix_memalign() is
available since glibc 2.1.91.


CONFORMING TO
The function valloc() appeared in 3.0 BSD. It is documented as
being obsolete in BSD 4.3, and as legacy in SUSv2. It no longer
occurs in SUSv3. The function memalign() appears in SunOS 4.1.3
but not in BSD 4.4. The function posix_memalign() comes from
POSIX 1003.1d.

Leopold Toetsch

unread,
May 30, 2003, 4:08:37 PM5/30/03
to Andy Switala, perl6-i...@perl.org

Thanks to you and Steve for the docs.
I've now a test and platform code for both flavors of memalign. I'll
send it later.

leo

Reply all
Reply to author
Forward
0 new messages