Hi Jean-Michel,
On Mon, 12 Apr 2021 at 01:56,
jean-mich...@csiro.au
<
jean-mich...@csiro.au> wrote:
> Thank you (belatedly) Armin for pointing this out. I am not a PyPy user but will see if I can understand its GC behavior, if I get this right. A leak would be a no-no of course; delayed disposal would not shock me, if similar in behavior to .NET, R, etc.
This is a class of problems that isn't really a leak, but has all the
appearances of one. The problem is that a GC like PyPy doesn't know
that each of the small proxy objects holds a potentially large amount
of memory behind the scenes. So PyPy won't bother running its own GC
very often. For example, if your program consumes 100 times more
memory in behind-the-scene objects than the total amount of Python
objects created (which can easily occur in some situations), then the
memory usage can grow 100 times above normal. It's a similar reason
for why it seems, on CPython, that you don't have to close files
explicitly, but on PyPy you do---or you run into EMFILE as soon as
your GC heap grows a bit larger and the GC collects less often. This
is not specific to PyPy; all non-reference-counting GCs suffer from it
(including Java and .NET).
One solution is to design the APIs so that they require explicit
finalization, with proxy objects that have got an explicit reference
count and explicit "incref" and "decref" methods. For some cases, you
can add a syntactic layer on top (e.g. context managers).
Another solution is to use some semi-internal API to tell the GC that
there are more resources behind its managed objects. In PyPy it is
``__pypy__.add_memory_pressure()``. See also
https://docs.microsoft.com/en-us/dotnet/api/system.gc.addmemorypressure
for the motivation from C#.
A bientôt,
Armin.