What say you? Vote for your favorite color of bikeshed now!
/arry
I propose we make the arena size larger. By how much? I asked Victor to run some benchmarks with arenas of 1mb, 2mb, and 4mb. The results with 1mb and 2mb were mixed, but his benchmarks with a 4mb arena size showed measurable (>5%) speedups on ten benchmarks and no slowdowns.
The arena size defines the strict minimum memory usage of Python. With
256 kB, it means that the smallest memory usage is 256 kB.
> What would be the result of making the arena size 4mb?
A minimum memory usage of 4 MB. It also means that if you allocate 4
MB + 1 byte, Python will eat 8 MB from the operating system.
The GNU libc malloc uses a variable threshold to choose between sbrk()
(heap memory) or mmap(). It starts at 128 kB or 256 kB, and then is
adapted depending on the workload (I don't know how exactly).
I would prefer to have an adaptative arena size. For example start at
256 kB and then double the arena size while the memory usage grows.
The problem is that pymalloc is currently designed for a fixed arena
size. I have no idea how hard it would be to make the size per
allocated arena.
I already read that CPU support "large pages" between 2 MB and 1 GB,
instead of just 4 kB. Using large pages can have a significant impact
on performance. I don't know if we can do something to help the Linux
kernel to use large pages for our memory? I don't know neither how we
could do that :-) Maybe using mmap() closer to large pages will help
Linux to join them to build a big page? (Linux has something magic to
make applications use big pages transparently.)
More generally: I'm strongly in favor of making our memory allocator
more efficient :-D
When I wrote my tracemalloc PEP 454, I counted that Python calls
malloc() , realloc() or free() 270,000 times per second in average
when running the Python test suite:
https://www.python.org/dev/peps/pep-0454/#log-calls-to-the-memory-allocator
(now I don't recall if it was really "malloc" or PyObject_Malloc, but
well, we do a lot of memory allocations and deallocations ;-))
When I analyzed the timeline of CPython master performance, I was
surprised to see that my change on PyMem_Malloc() to make it use
pymalloc was one of the most significant "optimization" of the Python
3.6!
http://pyperformance.readthedocs.io/cpython_results_2017.html#pymalloc-allocator
The CPython performance heavily depends on the performance of our
memory allocator, at least of the performance of pymalloc (the
specialized allocator for blocks <= 512 bytes).
By the way, Naoki INADA also proposed a different idea:
"Global freepool: Many types has it’s own freepool. Sharing freepool
can increase memory and cache efficiency. Add PyMem_FastFree(void*
ptr, size_t size) to store memory block to freepool, and PyMem_Malloc
can check global freepool first."
http://faster-cpython.readthedocs.io/cpython37.html
IMHO It's also worth it to investigate this change as well.
Victor
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com
I would prefer to have an adaptative arena size. For example start at 256 kB and then double the arena size while the memory usage grows. The problem is that pymalloc is currently designed for a fixed arena size. I have no idea how hard it would be to make the size per allocated arena.
This is already exactly how PyObject_Malloc() works. Really, the fast
path for PyObject_Malloc() is:
size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;
pool = usedpools[size + size];
if (pool != pool->nextpool) {
/*
* There is a used pool for this size class.
* Pick up the head block of its free list.
*/
++pool->ref.count;
bp = pool->freeblock;
assert(bp != NULL);
if ((pool->freeblock = *(block **)bp) != NULL) {
UNLOCK();
return (void *)bp; // <- fast path!
}
I don't think you can get much faster than that in a generic allocation
routine (unless you have a compacting GC where allocating memory is
basically bumping a single global pointer). IMHO the main thing the
private freelists have is that they're *private* precisely, so they can
avoid a couple of conditional branches.
Regards
Antoine.
If you'd like to go that way anyway, I would suggest 1MB as a starting point in 3.7.
* Many programs would be slightly faster now and then, simply because we call malloc() 1/16 as often.malloc() you said? Arenas are allocated using mmap() nowadays, right?
On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
malloc() and free(). See _PyObject_ArenaMalloc (etc) in Objects/obmalloc.c.malloc() you said? Arenas are allocated using mmap() nowadays, right?
As you said, I think PyObject_Malloc() is fast enough.
But PyObject_Free() is somewhat complex.
Actually, there are some freelists (e.g. tuple, dict, frame) and
they improve performance significantly.
My "global unified freelist" idea is unify them. The merit is:
* Unify _PyXxx_DebugMallocStats(). Some freelists provide
it but some doesn't.
* Unify PyXxx_ClearFreeList(). Some freelists doesn't provide
it and it may disturb returning memory to system.
* Potential better CPU cache hit ratio by unifying LRU if some
freelists has same memory block size.
This idea is partially implemented in https://github.com/methane/cpython/pull/3
But there are no significant difference about speed or memory usage.
Regards,
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
On 06/01/2017 01:41 AM, Larry Hastings wrote:
On 06/01/2017 01:19 AM, Antoine Pitrou wrote:
malloc() and free(). See _PyObject_ArenaMalloc (etc) in Objects/obmalloc.c.malloc() you said? Arenas are allocated using mmap() nowadays, right?
Oh, sorry, I forgot how to read. If ARENAS_USE_MMAP is on it uses mmap(). I can't figure out when or how MAP_ANONYMOUS gets set,
but if I step into the _PyObject_Arena.alloc() it indeed calls _PyObject_ArenaMmap() which uses mmap(). So, huzzah!, we use mmap() to allocate our enormous 256kb arenas.
/arry
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
I measured the performance boost of adding the free list for dict keys
structures. [1] This proposition was withdraw in the favor of using
PyObject_Malloc(). The latter solution is slightly slower, but simpler.
But even private free lists are not fast enough. That is why some
functions (zip, dict.items iterator, property getter, etc) have private
caches for tuples and the FASTCALL protocol added so much speedup.
At end we have multiple levels of free lists and caches, and every level
adds good speedup (otherwise it wouldn't used).
I also have found many times is spent in dealloc functions for tuples
called before placing an object back in a free list or memory pool. They
use the trashcan mechanism for guarding from stack overflow, and it is
costly in comparison with clearing 1-element tuple.
[1] https://bugs.python.org/issue16465
2017-06-01 10:41 GMT+02:00 Larry Hastings <la...@hastings.org>:On 06/01/2017 01:19 AM, Antoine Pitrou wrote:I understand the desire for caution. But I was hoping maybe we could experiment with 4mb in trunk for a while? We could change it to 1mb--or even 256k--before beta 1 if we get anxious.If you'd like to go that way anyway, I would suggest 1MB as a starting point in 3.7.While I fail to explain why in depth, I would prefer to *not* touch the default arena size at this point. We need more data, for example measure the memory usage on different workloads using different arena sizes.
A simple enhancement would be to add an environment variable to change the arena size at Python startup. Example: PYTHONARENASIZE=1M.
I would like to understand how private free lists are "so much" faster. In fact, I don't recall if someone *measured* the performance speedup of these free lists :-)
The issue [1] still is open. Patches neither applied nor rejected. They
exposes the speed up in microbenchmarks, but it is not large. Up to 40%
for iterating over enumerate() and 5-7% for hard integer computations
like base85 encoding or spectral_norm benchmark.
[1] https://bugs.python.org/issue25324
Another possible strategy is: allocate several arenas at once (using a larger mmap() call), and use MADV_DONTNEED to relinquish individual arenas.
block -> pool -> arena -> "multi-arena" -> OSY'know, this might actually make things faster. These multi-arenas could be the dynamically growing thing Victor wants to try. We allocate 16mb, then carve it up into arenas (however big those are), then next time allocate 32mb or what have you. Since the arenas remain a fixed size, we don't make the frequently-used code path (address_in_range) any slower. The code to deal with the multi-arenas would add a little complexity--to an admittedly already complex allocator implementation, but then what allocator isn't complex internally?--but it'd be an infrequent code path and I bet it'd be an improvement over simply calling malloc / mmap / VirtualAlloc. What do you think, Victor?
I would be curious of another test: use pymalloc for objects larger than 512 bytes. For example, allocate up to 4 KB? In the past, we already changed the maximum size from 256 to 512 to support most common Python objects on 64-bit platforms. Since Python objects contain many pointers: switching from 32 bit to 64 bit can double the size of the object in the worst case.
I should note that Py_ADDRESS_IN_RANGE is my code - this isn't a backhanded swipe at someone else.
I hope those are not the actual numbers you're intending to use ;-) I still think that allocating more than 1 or 2MB at once would be foolish. Remember this is data that's going to be carved up into (tens of) thousands of small objects. Large objects eschew the small object allocator (not to mention that third-party libraries like Numpy may be using different allocation routines when they allocate very large data).
[Larry Hastings <la...@hastings.org>]... Yet CPython's memory consumption continues to grow. By the time a current "trunk" build of CPython reaches the REPL prompt it's already allocated 16 arenas.
I'd be surprised if that's true ;-) The first time `new_arena()` is called, it allocates space for a vector of 16 (INITIAL_ARENA_OBJECTS) `arena_object` structs. Those are tiny, and hold bookkeeping info for the actual arenas, none of which are allocated at first.
So at most 9 arenas ("highwater mark") were ever simultaneously allocated..
I was hoping to spur a discussion of much higher level issues. I bet Larry was too ;-)
[Tim]
>> So at most 9 arenas ("highwater mark") were ever simultaneously allocated [by the
>> time the REPL prompt appeared in a 64-bit 3.6.1]..
> ... though not completely off-base.
Yes, 9 is in the ballpark of 16.
I think _some_ increase of arena size should be a no-brainer,
but I
don't expect it to help a lot.