[Python-Dev] The untuned tunable parameter ARENA_SIZE

256 views
Skip to first unread message

Larry Hastings

unread,
Jun 1, 2017, 3:40:09 AM6/1/17
to Python-Dev


When CPython's small block allocator was first merged in late February 2001, it allocated memory in gigantic chunks it called "arenas".  These arenas were a massive 256 KILOBYTES apiece.

This tunable parameter has not been touched in the intervening 16 years.  Yet CPython's memory consumption continues to grow.  By the time a current "trunk" build of CPython reaches the REPL prompt it's already allocated 16 arenas.

I propose we make the arena size larger.  By how much?  I asked Victor to run some benchmarks with arenas of 1mb, 2mb, and 4mb.  The results with 1mb and 2mb were mixed, but his benchmarks with a 4mb arena size showed measurable (>5%) speedups on ten benchmarks and no slowdowns.

What would be the result of making the arena size 4mb?
  • CPython could no longer run on a computer where at startup it could not allocate at least one 4mb continguous block of memory.
  • CPython programs would die slightly sooner in out-of-memory conditions.
  • CPython programs would use more memory.  How much?  Hard to say.  It depends on their allocation strategy.  I suspect most of the time it would be < 3mb additional memory.  But for pathological allocation strategies the difference could be significant.  (e.g: lots of allocs, followed by lots of frees, but the occasional object lives forever, which means that the arena it's in can never be freed.  If 1 out of ever 16 256k arenas is kept alive this way, and the objects are spaced out precisely such that now it's 1 for every 4mb arena, max memory use would be the same but later stable memory use would hypothetically be 16x current)
  • Many programs would be slightly faster now and then, simply because we call malloc() 1/16 as often.


What say you?  Vote for your favorite color of bikeshed now!


/arry

Larry Hastings

unread,
Jun 1, 2017, 3:42:56 AM6/1/17
to Python-Dev



On 06/01/2017 12:38 AM, Larry Hastings wrote:
I propose we make the arena size larger.  By how much?  I asked Victor to run some benchmarks with arenas of 1mb, 2mb, and 4mb.  The results with 1mb and 2mb were mixed, but his benchmarks with a 4mb arena size showed measurable (>5%) speedups on ten benchmarks and no slowdowns.

Oh, sorry!  Meant to add: thanks, Victor, for running these benchmarks for me!

Where are my manners?!


/arry

Victor Stinner

unread,
Jun 1, 2017, 3:59:20 AM6/1/17
to Larry Hastings, Python-Dev
2017-06-01 9:38 GMT+02:00 Larry Hastings <la...@hastings.org>:
> When CPython's small block allocator was first merged in late February 2001,
> it allocated memory in gigantic chunks it called "arenas". These arenas
> were a massive 256 KILOBYTES apiece.

The arena size defines the strict minimum memory usage of Python. With
256 kB, it means that the smallest memory usage is 256 kB.

> What would be the result of making the arena size 4mb?

A minimum memory usage of 4 MB. It also means that if you allocate 4
MB + 1 byte, Python will eat 8 MB from the operating system.

The GNU libc malloc uses a variable threshold to choose between sbrk()
(heap memory) or mmap(). It starts at 128 kB or 256 kB, and then is
adapted depending on the workload (I don't know how exactly).

I would prefer to have an adaptative arena size. For example start at
256 kB and then double the arena size while the memory usage grows.
The problem is that pymalloc is currently designed for a fixed arena
size. I have no idea how hard it would be to make the size per
allocated arena.

I already read that CPU support "large pages" between 2 MB and 1 GB,
instead of just 4 kB. Using large pages can have a significant impact
on performance. I don't know if we can do something to help the Linux
kernel to use large pages for our memory? I don't know neither how we
could do that :-) Maybe using mmap() closer to large pages will help
Linux to join them to build a big page? (Linux has something magic to
make applications use big pages transparently.)

More generally: I'm strongly in favor of making our memory allocator
more efficient :-D

When I wrote my tracemalloc PEP 454, I counted that Python calls
malloc() , realloc() or free() 270,000 times per second in average
when running the Python test suite:
https://www.python.org/dev/peps/pep-0454/#log-calls-to-the-memory-allocator
(now I don't recall if it was really "malloc" or PyObject_Malloc, but
well, we do a lot of memory allocations and deallocations ;-))

When I analyzed the timeline of CPython master performance, I was
surprised to see that my change on PyMem_Malloc() to make it use
pymalloc was one of the most significant "optimization" of the Python
3.6!
http://pyperformance.readthedocs.io/cpython_results_2017.html#pymalloc-allocator

The CPython performance heavily depends on the performance of our
memory allocator, at least of the performance of pymalloc (the
specialized allocator for blocks <= 512 bytes).

By the way, Naoki INADA also proposed a different idea:

"Global freepool: Many types has it’s own freepool. Sharing freepool
can increase memory and cache efficiency. Add PyMem_FastFree(void*
ptr, size_t size) to store memory block to freepool, and PyMem_Malloc
can check global freepool first."
http://faster-cpython.readthedocs.io/cpython37.html

IMHO It's also worth it to investigate this change as well.

Victor
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Larry Hastings

unread,
Jun 1, 2017, 4:09:00 AM6/1/17
to Victor Stinner, Python-Dev
On 06/01/2017 12:57 AM, Victor Stinner wrote:
I would prefer to have an adaptative arena size. For example start at
256 kB and then double the arena size while the memory usage grows.
The problem is that pymalloc is currently designed for a fixed arena
size. I have no idea how hard it would be to make the size per
allocated arena.

It's not hard.  The major pain point is that it'd make the address_in_range() inline function slightly more expensive.  Currently that code has ARENA_SIZE hardcoded inside it; if the size was dynamic we'd have to look up the size of the arena every time.  This function is called every time we free a pointer, so it's done hundreds of thousands of times per second (as you point out).

It's worth trying the experiment to see if dynamic arena sizes would make programs notably faster.  However... why not both?  Changing to 4mb arenas now is a one-line change, and on first examination seems mostly harmless, and yields an easy (if tiny) performance win.  If someone wants to experiment with dynamic arenas, they could go right ahead, and if it works well we could merge that too.


/arry

Antoine Pitrou

unread,
Jun 1, 2017, 4:21:09 AM6/1/17
to pytho...@python.org
On Thu, 1 Jun 2017 00:38:09 -0700
Larry Hastings <la...@hastings.org> wrote:
> * CPython programs would use more memory. How much? Hard to say. It
> depends on their allocation strategy. I suspect most of the time it
> would be < 3mb additional memory. But for pathological allocation
> strategies the difference could be significant. (e.g: lots of
> allocs, followed by lots of frees, but the occasional object lives
> forever, which means that the arena it's in can never be freed. If
> 1 out of ever 16 256k arenas is kept alive this way, and the objects
> are spaced out precisely such that now it's 1 for every 4mb arena,
> max memory use would be the same but later stable memory use would
> hypothetically be 16x current)

Yes, this is the same kind of reason the default page size is still 4KB
on many platforms today, despite typical memory size having grown by a
huge amount. Apart from the cost of fragmentation as you mentioned,
another issue is when many small Python processes are running on a
machine: a 2MB overhead per process can compound to large numbers if
you have many (e.g. hundreds) such processes.

I would suggest we exert caution here. Small benchmarks generally have
a nice memory behaviour: not only they do not allocate a lot of memory a,
but often they will release it all at once after a single run. Perhaps
some of those benchmarks would even be better off if we allocated 64MB
up front and never released it :-)

Long-running applications can be less friendly than that, having
various pieces of internal with unpredictable lifetimes (especially
when it's talking over the network with other peers which come and go).
And long-running applications are typically where Python memory usage is
a sensitive matter.

If you'd like to go that way anyway, I would suggest 1MB as a starting
point in 3.7.

> * Many programs would be slightly faster now and then, simply because
> we call malloc() 1/16 as often.

malloc() you said? Arenas are allocated using mmap() nowadays, right?

Regards

Antoine.

INADA Naoki

unread,
Jun 1, 2017, 4:26:19 AM6/1/17
to Larry Hastings, Python-Dev
Hello.

AFAIK, allocating arena doesn't eat real (physical) memory.

* On Windows, VirtualAlloc is used for arena. Real memory page is assigned
when the page is used first time.
* On Linux and some other *nix, anonymous mmap is used. Real page is
assigned when first touch, like Windows.

Arena size is more important for **freeing** memory.
Python returns memory to system when:

1. When no block in pool is used, it returned to arena.
2. When no pool is used, return the arena to system.

So only one memory block can disturb returning the whole arena.


Some VMs (e.g. mono) uses special APIs to return "real page" from
allocated space.

* On Windows, VirtualFree() + VirtualAlloc() can be used to unassign pages.
* On Linux, madvice(..., MADV_DONTNEED) can be used.
* On other *nix, madvice(..., MADV_DONTNEED) + madvice(..., MADV_FREE)
can be used.

See also:

https://github.com/corngood/mono/blob/ef186403b5e95a5c95c38f1f19d0c8d061f2ac37/mono/utils/mono-mmap.c#L204-L208
(Windows)
https://github.com/corngood/mono/blob/ef186403b5e95a5c95c38f1f19d0c8d061f2ac37/mono/utils/mono-mmap.c#L410-L424
(Unix)

I think we can return not recently used free pools to system in same way.
So more large arena size + more memory efficient can be achieved.

But I need more experiment.

Regards,

Victor Stinner

unread,
Jun 1, 2017, 4:36:43 AM6/1/17
to Antoine Pitrou, Python Dev
2017-06-01 10:19 GMT+02:00 Antoine Pitrou <soli...@pitrou.net>:
> Yes, this is the same kind of reason the default page size is still 4KB
> on many platforms today, despite typical memory size having grown by a
> huge amount. Apart from the cost of fragmentation as you mentioned,
> another issue is when many small Python processes are running on a
> machine: a 2MB overhead per process can compound to large numbers if
> you have many (e.g. hundreds) such processes.
>
> I would suggest we exert caution here. Small benchmarks generally have
> a nice memory behaviour: not only they do not allocate a lot of memory a,
> but often they will release it all at once after a single run. Perhaps
> some of those benchmarks would even be better off if we allocated 64MB
> up front and never released it :-)

By the way, the benchmark suite performance supports different ways to
trace memory usage:

* using tracemalloc
* using /proc/pid/smaps
* using VmPeak of /proc/pid/status (max RSS memory)

I wrote the code but I didn't try it yet :-) Maybe we should check the
memory usage before deciding to change the arena size?

Victor

Victor Stinner

unread,
Jun 1, 2017, 4:39:18 AM6/1/17
to INADA Naoki, Python-Dev
2017-06-01 10:23 GMT+02:00 INADA Naoki <songof...@gmail.com>:
> AFAIK, allocating arena doesn't eat real (physical) memory.
>
> * On Windows, VirtualAlloc is used for arena. Real memory page is assigned
> when the page is used first time.
> * On Linux and some other *nix, anonymous mmap is used. Real page is
> assigned when first touch, like Windows.

Memory fragmentation is also a real problem in pymalloc. I don't think
that pymalloc is designed to reduce the memory fragmentation.

I know one worst case: the Python parser which allocates small objects
which will be freed when the parser completes, while other objects
living longer are created.
https://github.com/haypo/misc/blob/master/memory/python_memleak.py

In a perfect world, the parser should use a different memory allocator
for that. But currently, the Python API doesn't offer this level of
granularity.

Victor

Antoine Pitrou

unread,
Jun 1, 2017, 4:42:26 AM6/1/17
to pytho...@python.org
On Thu, 1 Jun 2017 09:57:04 +0200
Victor Stinner <victor....@gmail.com> wrote:
>
> By the way, Naoki INADA also proposed a different idea:
>
> "Global freepool: Many types has it’s own freepool. Sharing freepool
> can increase memory and cache efficiency. Add PyMem_FastFree(void*
> ptr, size_t size) to store memory block to freepool, and PyMem_Malloc
> can check global freepool first."

This is already exactly how PyObject_Malloc() works. Really, the fast
path for PyObject_Malloc() is:

size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;
pool = usedpools[size + size];
if (pool != pool->nextpool) {
/*
* There is a used pool for this size class.
* Pick up the head block of its free list.
*/
++pool->ref.count;
bp = pool->freeblock;
assert(bp != NULL);
if ((pool->freeblock = *(block **)bp) != NULL) {
UNLOCK();
return (void *)bp; // <- fast path!
}


I don't think you can get much faster than that in a generic allocation
routine (unless you have a compacting GC where allocating memory is
basically bumping a single global pointer). IMHO the main thing the
private freelists have is that they're *private* precisely, so they can
avoid a couple of conditional branches.

Regards

Antoine.

Larry Hastings

unread,
Jun 1, 2017, 4:43:15 AM6/1/17
to pytho...@python.org
On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
If you'd like to go that way anyway, I would suggest 1MB as a starting
point in 3.7.

I understand the desire for caution.  But I was hoping maybe we could experiment with 4mb in trunk for a while?  We could change it to 1mb--or even 256k--before beta 1 if we get anxious.



  * Many programs would be slightly faster now and then, simply because
    we call malloc() 1/16 as often.
malloc() you said?  Arenas are allocated using mmap() nowadays, right?

malloc() and free().  See _PyObject_ArenaMalloc (etc) in Objects/obmalloc.c.

On Windows Python uses VirtualAlloc(), and I don't know what the implications are of that.


/arry

Larry Hastings

unread,
Jun 1, 2017, 4:48:01 AM6/1/17
to pytho...@python.org
On 06/01/2017 01:41 AM, Larry Hastings wrote:
On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
malloc() you said?  Arenas are allocated using mmap() nowadays, right?
malloc() and free().  See _PyObject_ArenaMalloc (etc) in Objects/obmalloc.c.

Oh, sorry, I forgot how to read.  If ARENAS_USE_MMAP is on it uses mmap().  I can't figure out when or how MAP_ANONYMOUS gets set, but if I step into the _PyObject_Arena.alloc() it indeed calls _PyObject_ArenaMmap() which uses mmap().  So, huzzah!, we use mmap() to allocate our enormous 256kb arenas.


/arry

Antoine Pitrou

unread,
Jun 1, 2017, 4:54:15 AM6/1/17
to pytho...@python.org
On Thu, 1 Jun 2017 01:41:15 -0700
Larry Hastings <la...@hastings.org> wrote:
> On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
> > If you'd like to go that way anyway, I would suggest 1MB as a starting
> > point in 3.7.
>
> I understand the desire for caution. But I was hoping maybe we could
> experiment with 4mb in trunk for a while? We could change it to 1mb--or
> even 256k--before beta 1 if we get anxious.

Almost nobody tests "trunk" (or "master" :-)) on production systems. At
best a couple rare open source projects will run their test suite on
the pre-release betas, but that's all. So we are unlikely to spot
memory usage ballooning problems before the final release.

> >> * Many programs would be slightly faster now and then, simply because
> >> we call malloc() 1/16 as often.
> > malloc() you said? Arenas are allocated using mmap() nowadays, right?
>
> malloc() and free(). See _PyObject_ArenaMalloc (etc) in Objects/obmalloc.c.

_PyObject_ArenaMalloc should only be used if the OS doesn't support
mmap() or MAP_ANONYMOUS (see ARENAS_USE_MMAP). Otherwise
_PyObject_ArenaMmap is used.

Apparently OS X doesn't have MAP_ANONYMOUS but it has the synonymous
MAP_ANON:
https://github.com/HaxeFoundation/hashlink/pull/12

INADA Naoki

unread,
Jun 1, 2017, 4:59:19 AM6/1/17
to Antoine Pitrou, Python-Dev
Hi,

As you said, I think PyObject_Malloc() is fast enough.
But PyObject_Free() is somewhat complex.

Actually, there are some freelists (e.g. tuple, dict, frame) and
they improve performance significantly.


My "global unified freelist" idea is unify them. The merit is:

* Unify _PyXxx_DebugMallocStats(). Some freelists provide
it but some doesn't.

* Unify PyXxx_ClearFreeList(). Some freelists doesn't provide
it and it may disturb returning memory to system.

* Potential better CPU cache hit ratio by unifying LRU if some
freelists has same memory block size.

This idea is partially implemented in https://github.com/methane/cpython/pull/3
But there are no significant difference about speed or memory usage.

Regards,

> Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com

Victor Stinner

unread,
Jun 1, 2017, 5:05:44 AM6/1/17
to Larry Hastings, Python Dev
2017-06-01 10:41 GMT+02:00 Larry Hastings <la...@hastings.org>:
> On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
> If you'd like to go that way anyway, I would suggest 1MB as a starting
> point in 3.7.
>
> I understand the desire for caution. But I was hoping maybe we could
> experiment with 4mb in trunk for a while? We could change it to 1mb--or
> even 256k--before beta 1 if we get anxious.

While I fail to explain why in depth, I would prefer to *not* touch
the default arena size at this point.

We need more data, for example measure the memory usage on different
workloads using different arena sizes.

It's really hard to tune a memory allocator for *any* use cases.

A simple enhancement would be to add an environment variable to change
the arena size at Python startup. Example: PYTHONARENASIZE=1M. If you
*know* that your application will allocate at least 2 GB, you may even
want to try PYTHONARENASIZE=1G which is more likely to use a single
large page... Such parameter cannot be used by default: it would make
the default Python memory usage insane ;-)

Victor

INADA Naoki

unread,
Jun 1, 2017, 5:19:16 AM6/1/17
to Larry Hastings, Python-Dev
> * On Linux, madvice(..., MADV_DONTNEED) can be used.

Recent Linux has MADV_FREE. It is faster than MADV_DONTNEED,
https://lwn.net/Articles/591214/

Victor Stinner

unread,
Jun 1, 2017, 5:22:25 AM6/1/17
to Antoine Pitrou, Python Dev
2017-06-01 10:40 GMT+02:00 Antoine Pitrou <soli...@pitrou.net>:
> This is already exactly how PyObject_Malloc() works. (...)

Oh ok, good to know...

> IMHO the main thing the
> private freelists have is that they're *private* precisely, so they can
> avoid a couple of conditional branches.

I would like to understand how private free lists are "so much"
faster. In fact, I don't recall if someone *measured* the performance
speedup of these free lists :-)

By the way, the Linux kernel uses a "SLAB" allocator for the most
common object types like inode. I'm curious to know if CPython would
benefit of a similar allocator for our most common object types? For
example types which already use a free list.

https://en.wikipedia.org/wiki/Slab_allocation

Victor

INADA Naoki

unread,
Jun 1, 2017, 5:25:34 AM6/1/17
to Siddhesh Poyarekar, Python-Dev
Thanks for detailed info.

But I don't think it's a big problem.
Arenas are returned to system by chance. So other processes
shouldn't relying to it.

And I don't propose to stop returning arena to system.
I just mean per pool (part of arena) MADV_DONTNEED can reduce RSS.

If we use very large arena, or stop returning arena to system,
it can be problem.

Regards,

On Thu, Jun 1, 2017 at 6:05 PM, Siddhesh Poyarekar <sidd...@gotplt.org> wrote:
> On Thursday 01 June 2017 01:53 PM, INADA Naoki wrote:
>> * On Linux, madvice(..., MADV_DONTNEED) can be used.
>
> madvise does not reduce the commit charge in the Linux kernel, so in
> high consumption scenarios (and where memory overcommit is disabled or
> throttled) you'll see programs dying with OOM despite the MADV_DONTNEED.
> The way we solved it in glibc was to use mprotect to drop PROT_READ and
> PROT_WRITE in blocks that we don't need when we detect that the system
> is not configured to overcommit (using /proc/sys/vm/overcommit_memory).
> You'll need to fix the protection again though if you want to reuse the
> block.
>
> Siddhesh

Thomas Wouters

unread,
Jun 1, 2017, 5:28:45 AM6/1/17
to Larry Hastings, Python-Dev
On Thu, Jun 1, 2017 at 10:45 AM, Larry Hastings <la...@hastings.org> wrote:
On 06/01/2017 01:41 AM, Larry Hastings wrote:
On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
malloc() you said?  Arenas are allocated using mmap() nowadays, right?
malloc() and free().  See _PyObject_ArenaMalloc (etc) in Objects/obmalloc.c.

Oh, sorry, I forgot how to read.  If ARENAS_USE_MMAP is on it uses mmap().  I can't figure out when or how MAP_ANONYMOUS gets set,

MAP_ANONYMOUS is set by sys/mman.h (where the system supports it), just like the other MAP_* defines.
 
but if I step into the _PyObject_Arena.alloc() it indeed calls _PyObject_ArenaMmap() which uses mmap().  So, huzzah!, we use mmap() to allocate our enormous 256kb arenas.


/arry

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev



--
Thomas Wouters <tho...@python.org>

Hi! I'm an email virus! Think twice before sending your email to help me spread!

INADA Naoki

unread,
Jun 1, 2017, 5:39:51 AM6/1/17
to Larry Hastings, Python-Dev
x86's hugepage is 2MB.
And some Linux enables "Transparent Huge Page" feature.

Maybe, 2MB arena size is better for TLB efficiency.
Especially, for servers having massive memory.
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Louie Lu

unread,
Jun 1, 2017, 5:44:59 AM6/1/17
to INADA Naoki, Python-Dev
For the ARENA_SIZE, will that be better to setting by ./configure first,
and without hard code in c files?
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/me%40louie.lu

Antoine Pitrou

unread,
Jun 1, 2017, 5:52:20 AM6/1/17
to pytho...@python.org
On Thu, 1 Jun 2017 18:37:17 +0900
INADA Naoki <songof...@gmail.com> wrote:
> x86's hugepage is 2MB.
> And some Linux enables "Transparent Huge Page" feature.
>
> Maybe, 2MB arena size is better for TLB efficiency.
> Especially, for servers having massive memory.

But, since Linux is able to merge pages transparently, we perhaps
needn't allocate large pages explicitly.

Another possible strategy is: allocate several arenas at once (using a
larger mmap() call), and use MADV_DONTNEED to relinquish individual
arenas.

Regards

Antoine.

INADA Naoki

unread,
Jun 1, 2017, 6:10:51 AM6/1/17
to Victor Stinner, Antoine Pitrou, Python Dev
I thought pymalloc is SLAB allocator.
What is the difference between SLAB and pymalloc allocator?
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com

Serhiy Storchaka

unread,
Jun 1, 2017, 6:14:02 AM6/1/17
to pytho...@python.org
01.06.17 12:20, Victor Stinner пише:

> 2017-06-01 10:40 GMT+02:00 Antoine Pitrou <soli...@pitrou.net>:
>> This is already exactly how PyObject_Malloc() works. (...)
>
> Oh ok, good to know...
>
>> IMHO the main thing the
>> private freelists have is that they're *private* precisely, so they can
>> avoid a couple of conditional branches.
>
> I would like to understand how private free lists are "so much"
> faster. In fact, I don't recall if someone *measured* the performance
> speedup of these free lists :-)

I measured the performance boost of adding the free list for dict keys
structures. [1] This proposition was withdraw in the favor of using
PyObject_Malloc(). The latter solution is slightly slower, but simpler.

But even private free lists are not fast enough. That is why some
functions (zip, dict.items iterator, property getter, etc) have private
caches for tuples and the FASTCALL protocol added so much speedup.

At end we have multiple levels of free lists and caches, and every level
adds good speedup (otherwise it wouldn't used).

I also have found many times is spent in dealloc functions for tuples
called before placing an object back in a free list or memory pool. They
use the trashcan mechanism for guarding from stack overflow, and it is
costly in comparison with clearing 1-element tuple.

[1] https://bugs.python.org/issue16465

Siddhesh Poyarekar

unread,
Jun 1, 2017, 10:54:05 AM6/1/17
to INADA Naoki, Larry Hastings, Python-Dev
On Thursday 01 June 2017 01:53 PM, INADA Naoki wrote:
> * On Linux, madvice(..., MADV_DONTNEED) can be used.

madvise does not reduce the commit charge in the Linux kernel, so in
high consumption scenarios (and where memory overcommit is disabled or
throttled) you'll see programs dying with OOM despite the MADV_DONTNEED.
The way we solved it in glibc was to use mprotect to drop PROT_READ and
PROT_WRITE in blocks that we don't need when we detect that the system
is not configured to overcommit (using /proc/sys/vm/overcommit_memory).
You'll need to fix the protection again though if you want to reuse the
block.

Siddhesh

Victor Stinner

unread,
Jun 1, 2017, 11:00:38 AM6/1/17
to INADA Naoki, Python-Dev
It seems very complex and not portable at all to "free" a part of an
arena. We already support freeing a whole arena using munmap(). I was
a huge enhancement in our memory allocator. Change made in Python 2.5?
I don't recall, ask Evan Jones:
http://www.evanjones.ca/memoryallocator/ :-)

I'm not sure that it's worth it to increase the arena size and try to
implement the MADV_DONTNEED / MADV_FREE thing.

Victor
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

Larry Hastings

unread,
Jun 1, 2017, 2:19:07 PM6/1/17
to Victor Stinner, Python Dev



On 06/01/2017 02:03 AM, Victor Stinner wrote:
2017-06-01 10:41 GMT+02:00 Larry Hastings <la...@hastings.org>:
On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
If you'd like to go that way anyway, I would suggest 1MB as a starting
point in 3.7.
I understand the desire for caution. But I was hoping maybe we could experiment with 4mb in trunk for a while? We could change it to 1mb--or even 256k--before beta 1 if we get anxious.
While I fail to explain why in depth, I would prefer to *not* touch
the default arena size at this point.

We need more data, for example measure the memory usage on different
workloads using different arena sizes.

I can't argue with collecting data at this point in the process.  My thesis is simply "the correct value for this tunable parameter in 2001 is probably not the same value in 2017".  I don't mind proceeding *slowly* or gathering more data or what have you for now.  But I would like to see it change somehow between now and 3.7.0b1, because my sense is that we can get some performance for basically free by updating the value.

If ARENA_SIZE tracked Moore's Law, meaning that we doubled it every 18 months like clockwork, it'd currently be 2**10 times bigger: 256MB, and we'd be changing it to 512MB at the end of August.

(And yes, as a high school student I was once bitten by a radioactive optimizer, so these days when I'm near possible optimizations my spider-sense--uh, I mean, my optimization-sense--starts tingling.)


A simple enhancement would be to add an environment variable to change
the arena size at Python startup. Example: PYTHONARENASIZE=1M.

Implementing this would slow down address_in_range which currently compiles in arena size.  It'd be by a tiny amount, but this inline function gets called very very frequently.  It's possible this wouldn't hurt performance, but my guess is it'd offset the gains we got from larger arenas, and the net result would be no faster or slightly slower.


/arry

Larry Hastings

unread,
Jun 1, 2017, 2:46:06 PM6/1/17
to pytho...@python.org


On 06/01/2017 02:20 AM, Victor Stinner wrote:
I would like to understand how private free lists are "so much" faster. In fact, I don't recall if someone *measured* the performance speedup of these free lists :-)

I have, recently, kind of by accident.  When working on the Gilectomy I turned off some freelists as they were adding "needless" complexity and presumably weren't helping performance that much.  Then I turned them back on because it turned out they really did help.

My intuition is that the help in two major ways:
  • Since they're a known size, you don't need to go through the general-case code of looking up the right spot in usedpools (etc) to get one / put one back in malloc/free.
  • The code that recycles these objects assumes that objects from its freelist are already mostly initialized, so it doesn't need to initialize them.
The really crazy one is PyFrameObjects.  The global freelist for these stores up to 200 (I think) in a stack, implemented as a simple linked list.  When CPython wants a new frame object, it takes the top one off the stack and uses it.  Where it gets crazy is: PyFrameObjects are dynamically sized, based on the number of arguments + local variables + stack + freevars + cellvars.  So the frame you pull off the free list might not be big enough.  If it isn't big enough, the code calls *realloc* on it then uses it.  This seems like such a weird approach to me.  But it's obviously a successful approach, and I've learned not to argue with success.

p.s. Speaking of freelists, at one point Serhiy had a patch adding a freelist for single- and I think two-digit ints.  Right now the only int creation optimization we have is the array of constant "small ints"; if the int you're constructing isn't one of those, we use the normal slow allocation path with PyObject_Alloc etc.  IIRC this patch made things faster.  Serhiy, what happened to that patch?  Was it actually a bad idea, or did it just get forgotten?


/arry

Serhiy Storchaka

unread,
Jun 1, 2017, 4:19:48 PM6/1/17
to pytho...@python.org
01.06.17 21:44, Larry Hastings пише:

> p.s. Speaking of freelists, at one point Serhiy had a patch adding a
> freelist for single- and I think two-digit ints. Right now the only int
> creation optimization we have is the array of constant "small ints"; if
> the int you're constructing isn't one of those, we use the normal slow
> allocation path with PyObject_Alloc etc. IIRC this patch made things
> faster. Serhiy, what happened to that patch? Was it actually a bad
> idea, or did it just get forgotten?

The issue [1] still is open. Patches neither applied nor rejected. They
exposes the speed up in microbenchmarks, but it is not large. Up to 40%
for iterating over enumerate() and 5-7% for hard integer computations
like base85 encoding or spectral_norm benchmark.

[1] https://bugs.python.org/issue25324

Siddhesh Poyarekar

unread,
Jun 1, 2017, 5:02:26 PM6/1/17
to Victor Stinner, Larry Hastings, Python-Dev
On Thursday 01 June 2017 01:27 PM, Victor Stinner wrote:
> The GNU libc malloc uses a variable threshold to choose between sbrk()
> (heap memory) or mmap(). It starts at 128 kB or 256 kB, and then is
> adapted depending on the workload (I don't know how exactly).

The threshold starts at 128K and increases whenever an mmap'd block is
freed. For example, if the program allocates 2M (which is returned
using mmap) and then frees that block, glibc malloc assumes that 2M
blocks will be needed again and optimizes that allocation by increasing
the threshold to 2M.

This works well in practice for common programs but it has been known to
cause issues in some cases, which is why there's MALLOC_MMAP_THRESHOLD_
to fix the threshold.

> I already read that CPU support "large pages" between 2 MB and 1 GB,
> instead of just 4 kB. Using large pages can have a significant impact
> on performance. I don't know if we can do something to help the Linux
> kernel to use large pages for our memory? I don't know neither how we
> could do that :-) Maybe using mmap() closer to large pages will help
> Linux to join them to build a big page? (Linux has something magic to
> make applications use big pages transparently.)

There's MAP_HUGETLB and friends for mmap flags, but it's generally
better to just let the kernel do this for you transparently (using
Transparent Huge Pages) by making sure that your arena allocations are
either contiguous or big enough.

Siddhesh

Victor Stinner

unread,
Jun 1, 2017, 5:27:05 PM6/1/17
to Serhiy Storchaka, Python Dev
2017-06-01 22:16 GMT+02:00 Serhiy Storchaka <stor...@gmail.com>:
> The issue [1] still is open. Patches neither applied nor rejected. They
> exposes the speed up in microbenchmarks, but it is not large. Up to 40% for
> iterating over enumerate() and 5-7% for hard integer computations like
> base85 encoding or spectral_norm benchmark.
>
> [1] https://bugs.python.org/issue25324

Hum, I think that the right issue is:
http://bugs.python.org/issue24165

Victor

Larry Hastings

unread,
Jun 1, 2017, 11:23:19 PM6/1/17
to pytho...@python.org

On 06/01/2017 02:50 AM, Antoine Pitrou wrote:
Another possible strategy is: allocate several arenas at once (using a
larger mmap() call), and use MADV_DONTNEED to relinquish individual
arenas.

Thus adding a *fourth* layer of abstraction over memory we get from the OS?
block -> pool -> arena -> "multi-arena" -> OS
Y'know, this might actually make things faster.  These multi-arenas could be the dynamically growing thing Victor wants to try.  We allocate 16mb, then carve it up into arenas (however big those are), then next time allocate 32mb or what have you.  Since the arenas remain a fixed size, we don't make the frequently-used code path (address_in_range) any slower.  The code to deal with the multi-arenas would add a little complexity--to an admittedly already complex allocator implementation, but then what allocator isn't complex internally?--but it'd be an infrequent code path and I bet it'd be an improvement over simply calling malloc / mmap / VirtualAlloc.  What do you think, Victor?

And to think I started this reply ironically,


/arry

Antoine Pitrou

unread,
Jun 2, 2017, 5:40:21 AM6/2/17
to pytho...@python.org
On Thu, 1 Jun 2017 20:21:12 -0700
Larry Hastings <la...@hastings.org> wrote:
> On 06/01/2017 02:50 AM, Antoine Pitrou wrote:
> > Another possible strategy is: allocate several arenas at once (using a
> > larger mmap() call), and use MADV_DONTNEED to relinquish individual
> > arenas.
>
> Thus adding a *fourth* layer of abstraction over memory we get from the OS?
>
> block -> pool -> arena -> "multi-arena" -> OS

Not a layer of abstraction, just over-allocation. You would
over-allocate arenas like you over-allocate a list object's elements
storage (though probably using a different over-allocation
strategy ;-)). That would reduce the number of mmap() calls (though
not necessarily the number of munmap() or madvise() calls), and also
provide more opportunities for the kernel to allocate a large page.

But you would still handle individual arenas in the allocation code.

> Y'know, this might actually make things faster. These multi-arenas
> could be the dynamically growing thing Victor wants to try. We allocate
> 16mb, then carve it up into arenas (however big those are), then next
> time allocate 32mb or what have you.

I hope those are not the actual numbers you're intending to use ;-)
I still think that allocating more than 1 or 2MB at once would be
foolish. Remember this is data that's going to be carved up into
(tens of) thousands of small objects. Large objects eschew the small
object allocator (not to mention that third-party libraries like Numpy
may be using different allocation routines when they allocate very
large data).

Victor Stinner

unread,
Jun 2, 2017, 5:48:59 AM6/2/17
to Larry Hastings, Python-Dev
I would be curious of another test: use pymalloc for objects larger
than 512 bytes. For example, allocate up to 4 KB?

In the past, we already changed the maximum size from 256 to 512 to
support most common Python objects on 64-bit platforms. Since Python
objects contain many pointers: switching from 32 bit to 64 bit can
double the size of the object in the worst case.

Victor
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

INADA Naoki

unread,
Jun 2, 2017, 7:58:44 AM6/2/17
to Victor Stinner, Python-Dev
> I would be curious of another test: use pymalloc for objects larger
> than 512 bytes. For example, allocate up to 4 KB?

Since current pool size is 4KB and there is pool_header in pool,
we can't allocate 4KB block from pool.
And if support 1KB block, only 3KB of 4KB can be actually used.
I think 512 bytes / 4KB (1/8) is good ratio.

Do you mean increase pool size?

How about adding configure option like server-mode?

SMALL_REQUEST_THRESHOLD 1024 // 2x
POOL_SIZE (16*1024) // 4x
ARENA_SIZE (2*1024*1024) // 8x, and same to huge page size.

>
> In the past, we already changed the maximum size from 256 to 512 to
> support most common Python objects on 64-bit platforms. Since Python
> objects contain many pointers: switching from 32 bit to 64 bit can
> double the size of the object in the worst case.
>

Make sense.


Naoki

Tim Peters

unread,
Jun 2, 2017, 2:25:13 PM6/2/17
to INADA Naoki, Python-Dev
[INADA Naoki <songof...@gmail.com>]
> ...
> Since current pool size is 4KB and there is pool_header in pool,
> we can't allocate 4KB block from pool.
> And if support 1KB block, only 3KB of 4KB can be actually used.
> I think 512 bytes / 4KB (1/8) is good ratio.
>
> Do you mean increase pool size?
>
> How about adding configure option like server-mode?
>
> SMALL_REQUEST_THRESHOLD 1024 // 2x
> POOL_SIZE (16*1024) // 4x
> ARENA_SIZE (2*1024*1024) // 8x, and same to huge page size.

While I would like to increase the pool size, it's fraught with
danger. The problem: Py_ADDRESS_IN_RANGE has to figure out whether
an "arbitrary" address is - or is not - controlled by Python's
small-object allocator. The excruciating logic is spelled out at
length in obmalloc.c.

As is, the code reads up memory near "the start" of a pool to get the
pool's belief about which arena it's in. If the memory is not in fact
controlled by pymalloc, this can be anything, even uninitialized
trash. That's fine (as explained in the comments) - but that memory
_must_ be readable.

And that's why POOL_SIZE is set to 4K: it's the smallest page size
across all the OSes Python is known to run on. If pymalloc is handed
an "arbitrary" (but valid) address, the entire OS page containing that
address is readable.

If, e.g., an OS allocates 4K pages, but Python's POOL_SIZE is 8K
(anything bigger than the OS's page allocation unit), then perhaps the
OS page at 16K is unallocated, but the page at 20K is. pymalloc sees
an address at 21K. As is, pymalloc reads the putative arena index at
20K, which is fine. But if POOL_SIZE were 8K, it would try to read
the putative arena index at 16K, and - boom! - segfault.

This failure mode would be rare but - of course - catastrophic when it occurs.

It would be nice to find a different way for pymalloc to figure out
which addresses belong to it. The excruciating Py_ADDRESS_IN_RANGE
manages to do it in small constant (independent of the number of
arenas and pools in use) time, which is its only virtue ;-)

Antoine Pitrou

unread,
Jun 2, 2017, 2:39:32 PM6/2/17
to pytho...@python.org
On Fri, 2 Jun 2017 13:23:05 -0500
Tim Peters <tim.p...@gmail.com> wrote:
>
> While I would like to increase the pool size, it's fraught with
> danger.

What would be the point of increasing the pool size? Apart from being
able to allocate 4KB objects out of it, I mean.

Since 4KB+ objects are relatively uncommon (I mean we don't allocate
hundreds of thousands of them per second), I don't think it's really
worthwhile trying to have the small object allocator handle them.

> It would be nice to find a different way for pymalloc to figure out
> which addresses belong to it. The excruciating Py_ADDRESS_IN_RANGE
> manages to do it in small constant (independent of the number of
> arenas and pools in use) time, which is its only virtue ;-)

So, to sum it up, it's excruciating but fast and works reliably. Why
change it?

Regards

Antoine.

Tim Peters

unread,
Jun 2, 2017, 3:19:14 PM6/2/17
to Antoine Pitrou, Python Dev
[Tim]
>> While I would like to increase the pool size, it's fraught with
>> danger.

[Antoine Pitrou <soli...@pitrou.net>]
> What would be the point of increasing the pool size? Apart from being
> able to allocate 4KB objects out of it, I mean.
>
> Since 4KB+ objects are relatively uncommon (I mean we don't allocate
> hundreds of thousands of them per second), I don't think it's really
> worthwhile trying to have the small object allocator handle them.

I don't care about "large" objects here. It's educated intuition
about speed: so long as pymalloc is working within a pool, it's
blazing fast. When it has to muck with a new pool, much slower
(compared to staying within a pool) code is required. When it has to
muck with a new arena, slower still.

So the intuition is simple: the larger a pool, the more object
operations it can handle staying within its best-case (fastest) code
paths.



>> It would be nice to find a different way for pymalloc to figure out
>> which addresses belong to it. The excruciating Py_ADDRESS_IN_RANGE
>> manages to do it in small constant (independent of the number of
>> arenas and pools in use) time, which is its only virtue ;-)

> So, to sum it up, it's excruciating but fast and works reliably. Why
> change it?

To enable using pools larger than the greatest common divisor of all
OS's native pool sizes.

There's also that Py_ADDRESS_IN_RANGE is responsible for countless
hours of head-scratching by poor developers trying to use magical
memory debuggers - it's at best unusual for code to read up memory
without caring a bit whether the memory has ever been stored to.

I should note that Py_ADDRESS_IN_RANGE is my code - this isn't a
backhanded swipe at someone else. It's always been near the top of
the "code stink" scale. So I thought I'd mention it in case someone
has been sitting on a cleaner idea but has held back because they
didn't want to offend me ;-)

Larry Hastings

unread,
Jun 2, 2017, 3:33:20 PM6/2/17
to Victor Stinner, Python-Dev


On 06/02/2017 02:46 AM, Victor Stinner wrote:
I would be curious of another test: use pymalloc for objects larger
than 512 bytes. For example, allocate up to 4 KB?

In the past, we already changed the maximum size from 256 to 512 to
support most common Python objects on 64-bit platforms. Since Python
objects contain many pointers: switching from 32 bit to 64 bit can
double the size of the object in the worst case.

You've already seen Tim Peters' post about why we must leave pool size set to 4k.  Obviously This in turn means using obmalloc for larger objects will mean more and more wasted memory.

For example, let's say we use obmalloc for allocations of 2048 bytes.  Pool size is 4096 bytes, and there's a 48-byte "pool_header" structure on the front (on 64-bit platforms, if I counted right).  So there are only 4048 bytes usable per pool.  After the first 2048 allocation, we're left with 2000 bytes at the end.  You can't use that memory for another allocation class, that's impossible given obmalloc's design.  So that 2000 bytes is just wasted.

Currently obmalloc's maximum allocation size is 512 bytes; after 7 allocations, this leaves 464 wasted bytes at the end.  Which isn't *great* exactly but it's only 11% of the overall allocated memory.

Anyway, I'm not super excited by the prospect of using obmalloc for larger objects.  There's an inverse relation between the size of allocation and the frequency of allocation.  In Python there are lots of tiny allocations, but fewer and fewer as the size increases.  (A similarly-shaped graph to what retailers call the "long tail".)  By no small coincidence, obmalloc is great at small objects, which is where we needed the help most.  Let's leave it at that.


A more fruitful endeavor might be to try one of these fancy new third-party allocators in CPython, e.g. tcmalloc, jemalloc.  Try each with both obmalloc turned on and turned off, and see what happens to performance and memory usage.  (I'd try it myself, but I'm already so far behind on watching funny cat videos.)


/arry

Larry Hastings

unread,
Jun 2, 2017, 3:36:07 PM6/2/17
to pytho...@python.org

On 06/02/2017 12:09 PM, Tim Peters wrote:
I should note that Py_ADDRESS_IN_RANGE is my code - this isn't a backhanded swipe at someone else.

One minor note.  During the development of 3.6, CPython started permitting some C99-isms, including static inline functions.  Py_ADDRESS_IN_RANGE was therefore converted from a macro into a static inline function, and its new name is address_in_range.

Just so we're all on the same (4k!) page,


/arry

Larry Hastings

unread,
Jun 2, 2017, 4:07:20 PM6/2/17
to pytho...@python.org

On 06/02/2017 02:38 AM, Antoine Pitrou wrote:
I hope those are not the actual numbers you're intending to use ;-)
I still think that allocating more than 1 or 2MB at once would be
foolish.  Remember this is data that's going to be carved up into
(tens of) thousands of small objects.  Large objects eschew the small
object allocator (not to mention that third-party libraries like Numpy
may be using different allocation routines when they allocate very
large data).

Honest, I'm well aware of what obmalloc does and how it works.  I bet I've spent more time crawling around in it in the last year than anybody else on the planet.  Mainly because it works so well for CPython, nobody else needed to bother!

I'm also aware, for example, that if your process grows to consume gigabytes of memory, you're going to have tens of thousands of allocated arenas.  The idea that on systems with gigabytes of memory--90%+? of current systems running CPython--we should allocate memory forever in 256kb chunks is faintly ridiculous.  I agree that we should start small, and ramp up slowly, so Python continues to run well on small computers and not allocate tons of memory for small programs.  But I also think we should ramp up *ever*, for programs that use tens or hundreds of megabytes.

Also note that if we don't touch the allocated memory, smart modern OSes won't actually commit any resources to it.  All that happens when your process allocates 1GB is that the OS changes some integers around.  It doesn't actually commit any memory to your process until you attempt to write to that memory, at which point it gets mapped in in local-page-size chunks (4k? 8k? something in that neighborhood and power-of-2 sized).  So if we allocate 32mb, and only touch the first 1mb, the other 31mb doesn't consume any real resources.  I was planning on making the multi-arena code only touch memory when it actually needs to, similarly to the way obmalloc lazily consumes memory inside an allocated pool (see the nextoffset field in pool_header), to take advantage of this ubiquitous behavior.


If I write this multi-arena code, which I might, I was thinking I'd try this approach:
  • leave arenas themselves at 256k
  • start with a 1MB multi-arena size
  • every time I allocate a new multi-arena, multiply the size of the next multi-arena by 1.5 (rounding up to 256k each time)
  • every time I free a multi-arena, divide the size of the next multi-arena by 2 (rounding up to 256k each time)
  • if allocation of a multi-arena fails, use a binary search algorithm to allocate the largest multi-arena possible (rounding up to 256k at each step)
  • cap the size of multi arenas at, let's say, 32mb
So multi-arenas would be 1mb, 1.5mb, 2.25mb, 3.5mb (round up!), etc.


Fun fact: Python allocates 16 arenas at the start of the program, just to initialize obmalloc.  That consumes 4mb of memory.  With the above multi-arena approach, that'd allocate the first three multi-arenas, pre-allocating 19 arenas, leaving 3 unused.  It's *mildly* tempting to make the first multi-arena be 4mb, just so this is exactly right-sized, but... naah.


/arry

Antoine Pitrou

unread,
Jun 3, 2017, 6:09:10 AM6/3/17
to pytho...@python.org
On Fri, 2 Jun 2017 12:31:19 -0700
Larry Hastings <la...@hastings.org> wrote:
>
> Anyway, I'm not super excited by the prospect of using obmalloc for
> larger objects. There's an inverse relation between the size of
> allocation and the frequency of allocation. In Python there are lots of
> tiny allocations, but fewer and fewer as the size increases. (A
> similarly-shaped graph to what retailers call the "long tail".) By no
> small coincidence, obmalloc is great at small objects, which is where we
> needed the help most. Let's leave it at that.

+1 to that and nice explanation.

> A more fruitful endeavor might be to try one of these fancy new
> third-party allocators in CPython, e.g. tcmalloc, jemalloc. Try each
> with both obmalloc turned on and turned off, and see what happens to
> performance and memory usage. (I'd try it myself, but I'm already so
> far behind on watching funny cat videos.)

We should lobby for a ban on funny cat videos so that you spend more
time on CPython.

Regards

Antoine.

Tim Peters

unread,
Jun 3, 2017, 10:48:22 PM6/3/17
to Python Dev
For fun, let's multiply everything by 256:

- A "pool" becomes 1 MB.
- An "arena" becomes 64 MB.

As briefly suggested before, then for any given size class a pool
could pass out hundreds of times more objects before needing to fall
back on the slower code creating new pools or new arenas.

As an added bonus, programs would finish much sooner due to the flurry
of segfaults from Py_ADDRESS_IN_RANGE ;-)

But greatly increasing the pool size also makes a different
implementation of that much more attractive: an obvious one. That
is, obmalloc could model its address space with a bit vector, one bit
per pool-aligned address. For a given address, shift it right by 20
bits (divide by 1MB) and use what remains as the bit vector index. If
the bit is set, obmalloc manages that MB, else (or if the bit address
is out of the vector's domain) it doesn't. The system page size would
become irrelevant to its operation, and it would play nice with
magical memory debuggers (it would never access memory obmalloc hadn't
first allocated and initialized itself).

A virtual address space span of a terabyte could hold 1M pools, so
would "only" need a 1M/8 = 128KB bit vector. That's minor compared to
a terabyte (one bit per megabyte).

Of course using a bit per 4KB (the current pool size) is less
attractive - by a factor of 256. Which is why that wasn't even tried.

Note that trying to play the same trick with arenas instead would be
at best more complicated. The system calls can't be relied on to
return arena-aligned _or_ pool-aligned addresses. obmalloc itself
forces pool-alignment of pool base addresses, by (if necessary)
ignoring some number of the leading bytes in an arena. That makes
useful arithmetic on pool addresses uniform and simple.

Antoine Pitrou

unread,
Jun 4, 2017, 4:20:14 AM6/4/17
to pytho...@python.org
On Sat, 3 Jun 2017 21:46:09 -0500
Tim Peters <tim.p...@gmail.com> wrote:
>
> A virtual address space span of a terabyte could hold 1M pools, so
> would "only" need a 1M/8 = 128KB bit vector. That's minor compared to
> a terabyte (one bit per megabyte).

The virtual address space currently supported by x86-64 is 48 bits
wide (spanning an address space of 2**48 bytes, that is 256 TB), so you
would need a 2**(48-20) bits vector, i.e. 256 Mbits or 32 MB.

Note Intel has plans to extend the virtual address space to 2**57 bytes:
https://www.phoronix.com/scan.php?page=news_item&px=Intel-5-Level-Paging

> The system calls can't be relied on to
> return arena-aligned _or_ pool-aligned addresses.

Unless using mmap() for pools with a size equal to or an exact divisor
of the system's page size, that is ;-)

Regards

Antoine.

Tim Peters

unread,
Jun 4, 2017, 10:48:30 AM6/4/17
to Antoine Pitrou, Python Dev
[Tim]
>> A virtual address space span of a terabyte could hold 1M pools, so
>> would "only" need a 1M/8 = 128KB bit vector. That's minor compared to
>> a terabyte (one bit per megabyte).

[Antoine]
> The virtual address space currently supported by x86-64 is 48 bits
> wide (spanning an address space of 2**48 bytes, that is 256 TB), so you
> would need a 2**(48-20) bits vector, i.e. 256 Mbits or 32 MB.
>
> Note Intel has plans to extend the virtual address space to 2**57 bytes:
> https://www.phoronix.com/scan.php?page=news_item&px=Intel-5-Level-Paging

Fill in the blanks ;-) There's only a need for the bit vector to
cover the range of addresses _actually returned_ by the system for
arenas allocated so far (we're only trying to identify the memory
obmalloc _does_ control) . I didn't spell that out, but it was
implicit in glosses like "(or if the bit address is out of the
vector's domain)". That is, it's up to the bit vector implementation
to intelligently represent what's almost always going to be a
relatively tiny slice of a theoretically massive address space.

So my observation about a terabyte is to be read as applying to a case
in which obmalloc has _actually allocated_ arenas spanning a terabyte
of address space.(2**14 arenas of 64MB each, and if the system is kind
enough to keep them contiguous). Since that's about 100 times more
address space than any Python program I've run actually needed, it's a
bare minimum for a thought experiment.

Antoine Pitrou

unread,
Jun 4, 2017, 1:21:36 PM6/4/17
to pytho...@python.org
On Sun, 4 Jun 2017 09:46:10 -0500
Tim Peters <tim.p...@gmail.com> wrote:
> [Tim]
> >> A virtual address space span of a terabyte could hold 1M pools, so
> >> would "only" need a 1M/8 = 128KB bit vector. That's minor compared to
> >> a terabyte (one bit per megabyte).
>
> [Antoine]
> > The virtual address space currently supported by x86-64 is 48 bits
> > wide (spanning an address space of 2**48 bytes, that is 256 TB), so you
> > would need a 2**(48-20) bits vector, i.e. 256 Mbits or 32 MB.
> >
> > Note Intel has plans to extend the virtual address space to 2**57 bytes:
> > https://www.phoronix.com/scan.php?page=news_item&px=Intel-5-Level-Paging
>
> Fill in the blanks ;-) There's only a need for the bit vector to
> cover the range of addresses _actually returned_ by the system for
> arenas allocated so far (we're only trying to identify the memory
> obmalloc _does_ control) . I didn't spell that out, but it was
> implicit in glosses like "(or if the bit address is out of the
> vector's domain)". That is, it's up to the bit vector implementation
> to intelligently represent what's almost always going to be a
> relatively tiny slice of a theoretically massive address space.

True. That works if the operating system doesn't go too wild in
address space randomization, though ;-)

Regards

Antoine.

Tim Peters

unread,
Jun 4, 2017, 2:52:35 PM6/4/17
to Antoine Pitrou, Python Dev
[Tim]
>> ... That is, it's up to the bit vector implementation
>> to intelligently represent what's almost always going to be a
>> relatively tiny slice of a theoretically massive address space.

[Antoine]
> True. That works if the operating system doesn't go too wild in
> address space randomization, though ;-)

I don't know that it's a real problem, but if it is I'm sure you know
of efficient ways to implement sparse sets (for example, picture
Python's set implementation, but slimmed down to handle just C uint64
members).

I was hoping to spur a discussion of much higher level issues. I bet
Larry was too ;-)

Tim Peters

unread,
Jun 4, 2017, 4:20:42 PM6/4/17
to Larry Hastings, Python-Dev
[Larry Hastings <la...@hastings.org>]
> ...
> Yet CPython's memory consumption continues to grow. By the time a current
> "trunk" build of CPython reaches the REPL prompt it's already allocated 16
> arenas.

I'd be surprised if that's true ;-) The first time `new_arena()` is
called, it allocates space for a vector of 16 (INITIAL_ARENA_OBJECTS)
`arena_object` structs. Those are tiny, and hold bookkeeping info for
the actual arenas, none of which are allocated at first.
`new_arena()` asks the system for memory for just one arena (and 15 of
its 16 initial arena_object structs remain unused, awaiting future
calls).

At the time, 16 wasn't picked because it was the minimum Python
needed, but because it was the _most_ the bulk of a variety of casual
Python scripts needed to complete.

It's certainly gotten worse on a 64-bit box. Here on a 64-bit Win10
under Python 3.6.1 upon reaching the interactive prompt in a DOS box
(set envar PYTHONMALLOCSTATS to get this kind of output whenever
`new_arena()` is called):

# arenas allocated total = 14
# arenas reclaimed = 5
# arenas highwater mark = 9
# arenas allocated current = 9
9 arenas * 262144 bytes/arena = 2,359,296

# bytes in allocated blocks = 2,166,656
# bytes in available blocks = 140,104
0 unused pools * 4096 bytes = 0
# bytes lost to pool headers = 27,648
# bytes lost to quantization = 24,888
# bytes lost to arena alignment = 0
Total = 2,359,296

So at most 9 arenas ("highwater mark") were ever simultaneously allocated..

Change the arena size to 64MB, and most programs would stop with the
first arena allocated ;-)

Gregory P. Smith

unread,
Jun 5, 2017, 4:56:26 PM6/5/17
to Larry Hastings, Victor Stinner, Python-Dev
FYI - in CPython using a different malloc instead of CPython's own obmalloc is effectively a simple addition of ~three lines of code to something Py_InitializeEx()-ish thanks to tracemalloc:

    PyMemAllocator pma;
    ...
    PyMem_GetAllocator(PYMEM_DOMAIN_RAW, &pma);
    PyMem_SetAllocator(PYMEM_DOMAIN_OBJ, &pma); 

That sets the object allocator (normally obmalloc) to be the "system" malloc which I assume you are having your alternate-malloc (tcmalloc, jemalloc?, etc..) override.

As to where exactly I'd have to walk through the code... I see that it was just refactored beyond what I used to recognize as part of cleaning up interpreter startup.  (yay!) :)

For CPython at large, I don't want us to be in the business of shipping a malloc implementation (any more than we already do). But it does seem worth making what we've got tune-able without having to recompile to do it.

I liked the environment variable arena size setting idea. But there are likely more things that could be offered up for tuning by people deploying applications who have tested things to determine what is best for their specific needs.

A note about OS/HW page sizes: While hugepages and transparent hugepages can be a large performance increase due to less TLB cache misses, they don't fit every application. They can be painful for anyone who depends on a fork()'ed workers application model as a copy on write for a 2mb or 1gb page due to a single refcount update is... costly.

-gps

Larry Hastings

unread,
Jun 5, 2017, 6:57:40 PM6/5/17
to pytho...@python.org

On 06/04/2017 01:18 PM, Tim Peters wrote:
[Larry Hastings <la...@hastings.org>]
...
Yet CPython's memory consumption continues to grow.  By the time a current
"trunk" build of CPython reaches the REPL prompt it's already allocated 16
arenas.
I'd be surprised if that's true ;-)  The first time `new_arena()` is
called, it allocates space for a vector of 16 (INITIAL_ARENA_OBJECTS)
`arena_object` structs.  Those are tiny, and hold bookkeeping info for
the actual arenas, none of which are allocated at first.

Oh!  I thought it also allocated the arenas themselves, in a loop.  I thought I saw that somewhere.  Happy to be proved wrong...


So at most 9 arenas ("highwater mark") were ever simultaneously allocated..

... though not completely off-base.



On 06/04/2017 11:50 AM, Tim Peters wrote:
I was hoping to spur a discussion of much higher level issues.  I bet
Larry was too ;-)

Actually I was hoping everyone would just tell me how right I was and thank me for my profound insights.


/arry

Tim Peters

unread,
Jun 5, 2017, 11:13:08 PM6/5/17
to Larry Hastings, Python Dev
[Larry]
> ...
> Oh! I thought it also allocated the arenas themselves, in a loop. I
> thought I saw that somewhere. Happy to be proved wrong...

There is a loop in `new_arena()`, but it doesn't do what a casual
glance may assume it's doing ;-) It's actually looping over the
newly-allocated teensy arena descriptor structs, linking them in to a
freelist and recording that they're _not_ (yet) associated with any
address space.


[Tim]
>> So at most 9 arenas ("highwater mark") were ever simultaneously allocated [by the
>> time the REPL prompt appeared in a 64-bit 3.6.1]..

> ... though not completely off-base.

Yes, 9 is in the ballpark of 16.

>> I was hoping to spur a discussion of much higher level issues. I bet
>> Larry was too ;-)

> Actually I was hoping everyone would just tell me how right I was and thank
> me for my profound insights.

Thank you! It should be thought about again.

I think _some_ increase of arena size should be a no-brainer, but I
don't expect it to help a lot. For reasons already partly explained,
I expect we'd get much better bang for the buck by increasing the pool
size:

- Roughly speaking, we bash into a slows-the-code pool boundary 64
times as frequently as an arena boundary. If the arena size increased,
that ratio only gets worse.

- On 64-bit boxes the bytes lost to pool headers increased, but the
pool size did not. Thus we're guaranteed to "waste" a higher
_percentage_ of allocated bytes than we did on a 32-bit box.

- The small object threshold also doubled. Generally (not always),
the bytes lost to quantization increase the larger the size class.
For example, for the largest size class now, on a 64-bit box we can
only fit 7 512-byte objects into the 4096 - 48 = 4048 bytes that
remain in a pool. So, in addition to the 48 pool header bytes, we
_also_ lose the 464 leftover bytes. So we've lost 512 of the 4096
bytes: 12.5%. That's certainly not typical, but in any case
quantization losses as percentage of total bytes decrease the larger
the pool.

Alas, I haven't thought of a plausibly practical way to replace
Py_ADDRESS_IN_RANGE unless the pool size increases a whole frickin'
lot :-(

Chris Barker

unread,
Jun 6, 2017, 6:37:52 PM6/6/17
to Tim Peters, Python Dev
On Mon, Jun 5, 2017 at 8:10 PM, Tim Peters <tim.p...@gmail.com> wrote:
[Tim]
>> So at most 9 arenas ("highwater mark") were ever simultaneously allocated [by the
>> time the REPL prompt appeared in a 64-bit 3.6.1]..

> ... though not completely off-base.

Yes, 9 is in the ballpark of 16.

 
I think _some_ increase of arena size should be a no-brainer,

Maybe big enough to hold a bare, just started up interpreter?
 
but I
don't expect it to help a lot. 

I was wondering about that -- in my experience with making re-sizable numpy arrays, (not the same use-case, I know!), I found that allocating memory wasn't all that slow, really. IN that use case, if you re-allocated every time you added a single element, it was dog-slow. But once you allocated less than, say, about every 10 or so, it started to make little difference how much you over-allocated.

In this case, I'm thinking that as long as there is a not-tiny arena, it just isn't going to make that much difference.

But only profiling will tell us.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov
Reply all
Reply to author
Forward
0 new messages