How to proceed to reduce Sage's memory leaking?

586 views
Skip to first unread message

Nils Bruin

unread,
Nov 3, 2012, 3:58:08 PM11/3/12
to sage-devel
Presently, Sage has a significant memory leak issue: Uniqueness of
parents is currently guaranteed by keeping them in memory
permanently.This prevents many computational strategies that are
otherwise perfectly legitimate, but require the construction of, for
instance, many finite fields and/or polynomial rings. A lot of
arithmetic geometric constructions fall in that category and current
approaches to tackle noncommutative algebra do too. Every single time
I have tried to use sage for some significant computation, this has
prevented me from completing it.

There has been work on resolving this issue by replacing permanent
storage by weak caching, so that parents can actually get delete; see
tickets #715 and #11521, for instance. The code on these tickets is by
now one of the most carefully reviewed (future) parts of sage.
However, time and again, issues elsewhere crop up because there is
broken code elsewhere that never got exercised because parents never
got deleted, even though they should.

We have been in shape a couple of times now, where all noticeable
issues were resolved. However, the merger of *other* tickets brought
to light even different issues, resulting in pulling #715 and #11521.

If we ever want sage to be usable on a reasonable scale, parents need
to be deleted every now and again. The basic design allows them to.
It's just that there is a lot code in sage that breaks when that
actually happens. Apparently, the normal ticket review and merger
process is not suitable for a change that's so fundamental to sage's
infrastructure, because it favours small-scale and superficial patches
(and hence keeps moving the goal posts for infrastructure changes).
Any ideas how to get this done?
For me this is a must-have to consider sage as viable platform and I
suspect I am not the only one for which it is.

Cheers,

Nils

Jeroen Demeyer

unread,
Nov 3, 2012, 4:12:54 PM11/3/12
to sage-...@googlegroups.com
Let me add to this that the bugs revealed by these tickets are often
quite complex. These are hard to debug, both for Nils Bruin and Simon
King working on the ticket, and for me as release manager.

For example, I remember in the past two seemingly unrelated tickets
which together caused a bug, but independently did not.

Travis Scrimshaw

unread,
Nov 3, 2012, 5:58:10 PM11/3/12
to sage-...@googlegroups.com
Here are my thoughts on the matter, but I'm not an expert on the innerworkings of sage, so please forgive/tell me if this is already done/impossible.

I propose limiting the size of the cache of parents and keep track of the references of parents. Thus parents with the fewest references should be replaced from the cache once we reach the maximum. Additionally if a parent has no references, we allow the garbage collector to take the parent. To get around referenced parents being spontaneously deleted, every time we return a parent object, we have a lightweight bridge class which recreates the parents if they've been deleted when called (which also gets notified when the parent is deleted). Something like this:

class ParentBridge:
    def __init__(self, parent_class, data):
        self._parent_class = parent_class
        self._data = data # arguments passed to the parent
        self._parent = None

    def parent(self):
        if self._parent is None:
            self._create_parent()
        return self._parent

    def _create_parent(self):
        # Do stuff to create parent in the cache
        self._parent = # the created parent

    def _parent_deleted(self):
        self._parent = None

We also return the same ParentBridge when the parent is stored in the cache. This would basically be a slight modification of a weak reference which recreates the target object if it is invalid. Another variant is we implement some other type of cache replacement algorithm (http://en.wikipedia.org/wiki/Cache_algorithms).

Alternatively we could just allow parents with no references are allowed to be garbage collected. This will likely not break any doctests since checking parent identity is usually in successive lines and the garbage collector usually does not have time to collect anything within a few lines when doctesting. We might also want to add a flag for (very) select instances which says that it can never be collected.

In both of the above, there is at most 1 instance of a given parent at any one time, so I do not foresee any problems (as long as we can reconstruct the parent object and appropriate references if it's deleted). Nevertheless, how we implement this must minimally change the interface, and I suspect the first way I suggested may require substantial change...

Best,
Travis

Volker Braun

unread,
Nov 3, 2012, 7:18:11 PM11/3/12
to sage-...@googlegroups.com
I'd say talk to Jeroen to make collectable parents a priority for one release. For example, lets have 5.5 as a the release where we add the collectable parents. Push out a beta1 with these patches, then we'll have a month during Jeroen's holiday where we can check any other tickets. No other tickets get merged if they break the parents stuff.




Jeroen Demeyer

unread,
Nov 3, 2012, 7:41:56 PM11/3/12
to sage-...@googlegroups.com
An extra complication is that the breakage is often non-reproducible and
system-dependent. Together with the wierd interaction between seemingly
unrelated patches, even determining whether a patch breaks the parent
stuff is very non-trivial.

Volker Braun

unread,
Nov 3, 2012, 8:06:42 PM11/3/12
to sage-...@googlegroups.com
You make it sound like there is just not enough doctesting coverage. The Sage doctests generally do not generate a lot of parents in one go. Maybe its just that the coverage of this use case needs to be improved? E.g. create a list of thousands of parents, delete random subset, garbage collect, repeat?

I admit that I haven't followed these patches as much as I would. Its clear that deleting parents can trigger lots of nasty stuff. We need to understand how to exercise that code.

If we can agree to dedicating a point release to this issue then that just means that beta0 is going to be broken on some systems. I take it this is Nils' original objection: Not every beta has to work perfectly on every system. If you merge a hundred small patches then its reasonable to kick everything back out that triggers a doctest failure. But if you want to make progress on a big issue then you have to accept that a beta is going to be imperfect and meant to expose a ticket to a much wider audience.

Francois Bissey

unread,
Nov 3, 2012, 8:23:57 PM11/3/12
to sage-...@googlegroups.com
On 04/11/12 13:06, Volker Braun wrote:
> You make it sound like there is just not enough doctesting coverage. The
> Sage doctests generally do not generate a lot of parents in one go.
> Maybe its just that the coverage of this use case needs to be improved?
> E.g. create a list of thousands of parents, delete random subset,
> garbage collect, repeat?
>
> I admit that I haven't followed these patches as much as I would. Its
> clear that deleting parents can trigger lots of nasty stuff. We need to
> understand how to exercise that code.
>
> If we can agree to dedicating a point release to this issue then that
> just means that beta0 is going to be broken on some systems. I take it
> this is Nils' original objection: Not every beta has to work perfectly
> on every system. If you merge a hundred small patches then its
> reasonable to kick everything back out that triggers a doctest failure.
> But if you want to make progress on a big issue then you have to accept
> that a beta is going to be imperfect and meant to expose a ticket to a
> much wider audience.
>
>

Actually, because some of the bugs are platform dependent etc... the
audience from a beta may not be big enough.
But nevertheless we have to just bit the bullet, do the best we can
and fix things as they become apparent. We cannot stop moving forward
because we are afraid to break stuff accidentally forever.

Francois

Jeroen Demeyer

unread,
Nov 4, 2012, 3:29:16 AM11/4/12
to sage-...@googlegroups.com
On 2012-11-04 01:06, Volker Braun wrote:
> You make it sound like there is just not enough doctesting coverage. The
> Sage doctests generally do not generate a lot of parents in one go.
> Maybe its just that the coverage of this use case needs to be improved?
> E.g. create a list of thousands of parents, delete random subset,
> garbage collect, repeat?
It would be absolutely awesome if we would have good doctests for this.
Of all the tickets I have ever seen as release manager, this is
probably the single hardest ticket to debug and find out why stuff
breaks (with #12221 as honorable second).

Jeroen Demeyer

unread,
Nov 4, 2012, 3:36:48 AM11/4/12
to sage-...@googlegroups.com
On 2012-11-04 01:23, Francois Bissey wrote:
> But nevertheless we have to just bit the bullet, do the best we can
> and fix things as they become apparent. We cannot stop moving forward
> because we are afraid to break stuff accidentally forever.
OK, let's go for it!

Do you want also other tickets like #12215 and #12313 or should we do
just #715 + #11521?

Francois Bissey

unread,
Nov 4, 2012, 4:14:41 AM11/4/12
to sage-...@googlegroups.com
It may be best to do only one set of big changes at a time just not
confuse issues. But these two sets may be similar enough.
Any other opinions?

Francois

Robert Bradshaw

unread,
Nov 5, 2012, 3:12:02 PM11/5/12
to sage-...@googlegroups.com
+1. I've always been meaning to get back to this for ages, but just
haven't found the time. If we're going to make a big push to get this
in, I'll do what I can to help.

For testing, I would propose we manually insert gc operations
periodically to see if we can reproduce the failures more frequently.
We could then marks some (hopefully a very small number) parents as
"unsafe to garbage collect" and go forward with this patch, holding
hard references to all "unsafe" parents to look into them later (which
isn't a regression).

- Robert

Simon King

unread,
Nov 5, 2012, 6:25:20 PM11/5/12
to sage-...@googlegroups.com
Hi Robert,

On 2012-11-05, Robert Bradshaw <robe...@gmail.com> wrote:
> +1. I've always been meaning to get back to this for ages, but just
> haven't found the time. If we're going to make a big push to get this
> in, I'll do what I can to help.

I'd appreciate your support!

> For testing, I would propose we manually insert gc operations
> periodically to see if we can reproduce the failures more frequently.

How can one insert gc operations? You mean, by inserting gc.collect()
into doctests, or by manipulating the Python call hook?

> We could then marks some (hopefully a very small number) parents as
> "unsafe to garbage collect" and go forward with this patch, holding
> hard references to all "unsafe" parents to look into them later (which
> isn't a regression).

That actually was what we tried: There was some bug that has only
occurred on bsd.math, and could be fixed by keeping a strong cache for
polynomial rings (which is inacceptable for my own project, but which
is at least no regression).

Anyway. I did not look into the new problems yet. If it is (again) about
libsingular polynomial rings, then I think we should really make an
effort to get reference counting for libsingular rings right.

Best regards,
Simon

Robert Bradshaw

unread,
Nov 5, 2012, 8:15:07 PM11/5/12
to sage-...@googlegroups.com
On Mon, Nov 5, 2012 at 3:25 PM, Simon King <simon...@uni-jena.de> wrote:
> Hi Robert,
>
> On 2012-11-05, Robert Bradshaw <robe...@gmail.com> wrote:
>> +1. I've always been meaning to get back to this for ages, but just
>> haven't found the time. If we're going to make a big push to get this
>> in, I'll do what I can to help.
>
> I'd appreciate your support!
>
>> For testing, I would propose we manually insert gc operations
>> periodically to see if we can reproduce the failures more frequently.
>
> How can one insert gc operations? You mean, by inserting gc.collect()
> into doctests, or by manipulating the Python call hook?

I was thinking about inserting it into the doctesting code, e.g. with
a random (know seen) x% chance between any two statements.

>> We could then marks some (hopefully a very small number) parents as
>> "unsafe to garbage collect" and go forward with this patch, holding
>> hard references to all "unsafe" parents to look into them later (which
>> isn't a regression).
>
> That actually was what we tried: There was some bug that has only
> occurred on bsd.math, and could be fixed by keeping a strong cache for
> polynomial rings (which is inacceptable for my own project, but which
> is at least no regression).
>
> Anyway. I did not look into the new problems yet. If it is (again) about
> libsingular polynomial rings, then I think we should really make an
> effort to get reference counting for libsingular rings right.

True, but I'd rather no particular ring hold us back from getting the
general fix in.

- Robert

Jeroen Demeyer

unread,
Nov 12, 2012, 4:47:16 PM11/12/12
to sage-...@googlegroups.com
Bad news again. During a preliminary test of sage-5.5.beta2, I got again
a segmentation fault in
devel/sage/sage/schemes/elliptic_curves/ell_number_field.py
but this time on a different system (arando: Linux i686) and with a
different set of patches as before. And for added fun: this time the
error isn't always reproducible.

Nils Bruin

unread,
Nov 12, 2012, 10:16:15 PM11/12/12
to sage-devel
On Nov 12, 1:47 pm, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
> And for added fun: this time the error isn't always reproducible.

That's excellent news! Just keep trying until it's not reproducible
anymore. Then we're fine!

Seriously though, given that the bug pops up in the same file as
before indicates that probably the deletion of a similar kind of
object is to blame here. We just need to keep trying until we find a
way to consistently produce the error on a platform with reasonable
debugging tools.

Incidentally: Are PPC-OSX4 (or where-ever the problem earlier arose)
and i686 both 32 bit platforms? My bet is singular, since we know
refcounting there (or at least our interfacing with it) is handled
fishily and a previous issue indicated that omalloc is almost taylor-
made to generate different problems on different wordlengths.

Michael Welsh

unread,
Nov 12, 2012, 10:17:57 PM11/12/12
to sage-...@googlegroups.com
On 13/11/2012, at 4:16 PM, Nils Bruin <nbr...@sfu.ca> wrote:
>
> Incidentally: Are PPC-OSX4 (or where-ever the problem earlier arose)
> and i686 both 32 bit platforms?

Yes.

Jean-Pierre Flori

unread,
Nov 13, 2012, 8:13:04 PM11/13/12
to sage-...@googlegroups.com
I'll try to setup a 32 bits (on i686) install of the latest beta this week end and give this a shot...
If I'm lucky enough, I'll be able to reproduce the problem and get a proper backtrace, hopefully pointing to libsingular.

Jeroen Demeyer

unread,
Nov 14, 2012, 11:29:48 AM11/14/12
to sage-...@googlegroups.com
It happens also on other systems, including 64-bit. It's easy to
reproduce on the Skynet machine "sextus" (Linux x86_64) where it happens
about 71% of the time.

Nils Bruin

unread,
Nov 14, 2012, 1:06:53 PM11/14/12
to sage-devel
On Nov 14, 8:29 am, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
> It happens also on other systems, including 64-bit.  It's easy to
> reproduce on the Skynet machine "sextus" (Linux x86_64) where it happens
> about 71% of the time.

That might be workable. What exact version/patches to reproduce the
problem? (I don't think I have a login on "sextus"). I don't promise
that I'll actually have time to build/test/track down this problem,
but I can see. Other people should definitely look at it too.

Jeroen Demeyer

unread,
Nov 14, 2012, 2:28:16 PM11/14/12
to sage-...@googlegroups.com
On 2012-11-14 19:06, Nils Bruin wrote:
> I don't think I have a login on "sextus"
FYI: it's a Fedora 16 system with an Intel(R) Pentium(R) 4 CPU 3.60GHz
processor running Linux 3.3.7-1.fc16.x86_64.

Nils Bruin

unread,
Nov 14, 2012, 5:34:27 PM11/14/12
to sage-devel
On Nov 14, 11:28 am, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
> FYI: it's a Fedora 16 system with an Intel(R) Pentium(R) 4 CPU 3.60GHz
> processor running Linux 3.3.7-1.fc16.x86_64.

That sounded convenient because my desktop is similar:

Fedora 16 running 3.6.5-2.fc16.x86_64 #1 SMP on Intel(R) Core(TM)
i7-2600 CPU @ 3.40GHz

No such luck, however:

with

$ ./sage -v
Sage Version 5.5.beta2, Release Date: 2012-11-13

I ran

for i in `seq 100`; do
echo $i;
./sage -t devel/sage/sage/schemes/elliptic_curves/
ell_number_field.py || echo FAULT AT i is $i
done

which succeeded all 100 times.

Nils Bruin

unread,
Nov 14, 2012, 6:42:23 PM11/14/12
to sage-devel
However, in an effort to make memory errors during testing a little
more reproducible I made this little edit to local/bin/sagedoctest.py
to ensure the garbage collector is run before every doctested line:

--------------------------------------------------------------------
diff --git a/sagedoctest.py b/sagedoctest.py
--- a/sagedoctest.py
+++ b/sagedoctest.py
@@ -1,7 +1,9 @@
from __future__ import with_statement

import ncadoctest
+import gc
import sage.misc.randstate as randstate
+import sys

OrigDocTestRunner = ncadoctest.DocTestRunner
class SageDocTestRunner(OrigDocTestRunner):
@@ -35,6 +37,8 @@ class SageDocTestRunner(OrigDocTestRunne
except Exception, e:
self._timeit_stats[key] = e
# otherwise, just run the example
+ sys.stderr.write('testing example %s\n'%example)
+ gc.collect()
OrigDocTestRunner.run_one_example(self, test, example,
filename, compileflags)

def save_timeit_stats_to_file_named(self, output_filename):
--------------------------------------------------------------------

(i.e., just add a gc.collect() to run_one_example)

and it causes a reliable failure in crypto/mq/mpolynomialsystem.py:

Trying:
C[Integer(0)].groebner_basis()###line 84:_sage_ sage:
C[0].groebner_basis()
Expecting:
Polynomial Sequence with 26 Polynomials in 16 Variables
testing example <ncadoctest.Example instance at 0x69706c8>
ok
Trying:
A,v = mq.MPolynomialSystem(r2).coefficient_matrix()###line
87:_sage_ sage: A,v = mq.MPolynomialSystem(r2).coefficient_matrix()
Expecting nothing
testing example <ncadoctest.Example instance at 0x6970710>
*** glibc detected *** python: double free or corruption (out):
0x00000000075c58c0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x31cfe7da76]
/lib64/libc.so.6[0x31cfe7ed5e]
/usr/local/sage/5.5b2/local/lib/python/site-packages/sage/rings/
polynomial/pbori.so(+0x880aa)[0x7fa5eba7e0aa]
/usr/local/sage/5.5b2/local/lib/python/site-packages/sage/rings/
polynomial/pbori.so(+0x1d993)[0x7fa5eba13993]
...

Running it under sage -t --gdb gives:

(gdb) bt
#0 0x00000031cfe36285 in raise () from /lib64/libc.so.6
#1 0x00000031cfe37b9b in abort () from /lib64/libc.so.6
#2 0x00000031cfe7774e in __libc_message () from /lib64/libc.so.6
#3 0x00000031cfe7da76 in malloc_printerr () from /lib64/libc.so.6
#4 0x00000031cfe7ed5e in _int_free () from /lib64/libc.so.6
#5 0x00007fffce5cb0aa in
Delete<polybori::groebner::ReductionStrategy> (mem=0x547db30)
at /usr/local/sage/5.5b2/local/include/csage/ccobject.h:77
#6
__pyx_pf_4sage_5rings_10polynomial_5pbori_17ReductionStrategy_2__dealloc__
(__pyx_v_self=<optimized out>)
at sage/rings/polynomial/pbori.cpp:37868
#7
__pyx_pw_4sage_5rings_10polynomial_5pbori_17ReductionStrategy_3__dealloc__
(__pyx_v_self=0x54bf390)
at sage/rings/polynomial/pbori.cpp:37834
#8
__pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy
(o=0x54bf390) at sage/rings/polynomial/pbori.cpp:52283
#9 0x00007fffce560993 in
__pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy
(o=0x54baeb0)
at sage/rings/polynomial/pbori.cpp:52545
#10 0x00007ffff7d4b637 in delete_garbage (old=0x7ffff7fe19e0,
collectable=0x7fffffffbb60) at Modules/gcmodule.c:769
#11 collect (generation=2) at Modules/gcmodule.c:930
#12 0x00007ffff7d4bdc9 in gc_collect (self=<optimized out>,
args=<optimized out>, kws=<optimized out>) at Modules/gcmodule.c:1067

which should give a pretty good pointer for pbori people to figure out
which memory deallocation is actually botched.

Nils Bruin

unread,
Nov 14, 2012, 7:15:34 PM11/14/12
to sage-devel
<polybori problem>:
This is actually reproducible in plain 5.0. This is now

http://trac.sagemath.org/sage_trac/ticket/13710

Nils Bruin

unread,
Nov 14, 2012, 7:22:24 PM11/14/12
to sage-devel
Other consequences from gc.collect() insertions:

sage -t -force_lib devel/sage/sage/crypto/mq/mpolynomialsystem.py #
Killed/crashed
sage -t -force_lib devel/sage/sage/rings/polynomial/
multi_polynomial_sequence.py # Killed/crashed

(same problem; reported as above)


**********************************************************************
File "/usr/local/sage/5.5b2/devel/sage/sage/modular/abvar/
abvar_ambient_jacobian.py", line 345:
sage: J0(33).decomposition(simple=False)
Expected:
[
Abelian subvariety of dimension 2 of J0(33),
Simple abelian subvariety 33a(None,33) of dimension 1 of J0(33)
]
Got:
[
Abelian subvariety of dimension 2 of J0(33),
Abelian subvariety of dimension 1 of J0(33)
]
**********************************************************************

sage -t -force_lib devel/sage/sage/modular/abvar/
abvar_ambient_jacobian.py # 1 doctests failed

(i.e., doctest is relying on a previous copy of 33a remaining in
memory on which additional computations have changed the way it
prints. That's a violation of immutability anyway and the doctest
shouldn't rely on such behaviour)


**********************************************************************
File "/usr/local/sage/5.5b2/devel/sage/sage/modular/abvar/abvar.py",
line 2840:
sage: J0(33).is_simple(none_if_not_known=True)
Expected:
False
Got nothing
**********************************************************************
sage -t -force_lib devel/sage/sage/modular/abvar/abvar.py # 1
doctests failed

Same problem! Since J0(33) is freshly constructed, one should not rely
on anything being cached on it and the test explicitly asks to not
compute anything.

Jean-Pierre Flori

unread,
Nov 14, 2012, 9:58:45 PM11/14/12
to sage-...@googlegroups.com
We dealt with something very similar in one of the "memleaks" tickets.
Not sure it was 715 or 11521, but maybe 12313 (the figures here might be wrong...).
So the fix is potentially not included in 5.5.beta2 if it was in the later.
 

Jean-Pierre Flori

unread,
Nov 14, 2012, 10:00:15 PM11/14/12
to sage-...@googlegroups.com
Ok, I took the time to check and you actually posted in 13710 that the fix is included in 12313, so not in 5.5.beta2 if I'm not wrong (nor 5.0 of course).

Jeroen Demeyer

unread,
Nov 16, 2012, 2:59:02 AM11/16/12
to sage-...@googlegroups.com
On 2012-11-14 23:34, Nils Bruin wrote:
> On Nov 14, 11:28 am, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
>> FYI: it's a Fedora 16 system with an Intel(R) Pentium(R) 4 CPU 3.60GHz
>> processor running Linux 3.3.7-1.fc16.x86_64.
>
> That sounded convenient because my desktop is similar:
>
> Fedora 16 running 3.6.5-2.fc16.x86_64 #1 SMP on Intel(R) Core(TM)
> i7-2600 CPU @ 3.40GHz

Could you try again with sage-5.5.beta1?

Nils Bruin

unread,
Nov 16, 2012, 1:35:52 PM11/16/12
to sage-devel
On Nov 15, 11:59 pm, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
> Could you try again with sage-5.5.beta1?

Same behaviour. Was there a reason to expect differently?
I guess something is different on sextus. Bad memory/other hardware
problems?

I was surprised by how little issues arose from inserting garbage
collections between all doctests. That should upset the memory usage
patterns so much that I would expect it to shake out many problems.
Only things like singular's omalloc would be immune, because it hides
alloc/dealloc operations from the OS. You really need to wait for an
actual corruption to see a problem. The guarded malloc experiment on
OSX and similar operations took care of that. See

http://trac.sagemath.org/sage_trac/ticket/13447

for a dirty singular package that switches out omalloc for a system
malloc, which then allows normal OS tools to check memory allocation/
access/deallocation. See also the ticket for notes on how the approach
taken there can be adapted to let Singular use the system malloc under
linux (one singular malloc routine needs to know the size of an
allocated block, which is a non-POSIX malloc feature that both OSX and
linux support in different ways).

Do we have other memory managers in sage that play tricks like
omalloc? Things run a lot slower when you switch back to system malloc
for these, but it does enable conventional memory sanitation tests.

Valgrind produces way too much warnings to be useful. All you want is
a segfault on any access-after-dealloc or double-dealloc (out-of-
bounds access would be nice too). OSX's libgmalloc is perfect for
that. Is there a linux equivalent (or a way to configure valgrind to
do just this)?

I pose it as a challenge that no-one is able to do a comprehensive
testing of memory alloc/dealloc in sage. Even though I outline the
exact approach above that would make it a relatively straightforward
process to go through, no-one has the stamina and heroic hacker skills
to pull it off. Prove me wrong!

Jeroen Demeyer

unread,
Nov 17, 2012, 4:01:22 AM11/17/12
to sage-...@googlegroups.com
On 2012-11-16 19:35, Nils Bruin wrote:
> On Nov 15, 11:59 pm, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
>> Could you try again with sage-5.5.beta1?
>
> Same behaviour. Was there a reason to expect differently?
After adding every single ticket, there is reason to expect differently.
This stuff is *so sensitive* to changes, even changes which look
completely unrelated.

For example, on first sight, the errors are gone again in sage-5.5.beta2.

Nils Bruin

unread,
Nov 17, 2012, 2:01:19 PM11/17/12
to sage-devel
On Nov 17, 1:01 am, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
> On 2012-11-16 19:35, Nils Bruin wrote:> On Nov 15, 11:59 pm, Jeroen Demeyer <jdeme...@cage.ugent.be> wrote:
> After adding every single ticket, there is reason to expect differently.
> This stuff is *so sensitive* to changes, even changes which look
> completely unrelated.
That's why the effort to do strict checking on memory management
should help (and it was in that light that I interpreted your
request). I think the sensitivity comes from the fact that you have to
wait for the coincidence that a freed-too-early location gets reused
and *then* written in its own role (i.e., actual corruption).

gc.collect() all the time should make deletions a little more
predictable and a very strict malloc/free should detect the problem
sooner. I'm afraid that MALLOC_CHECK_ isn't as good as BSD's gmalloc,
where even an access-after-free is a segfault (and many out-of-bound
accesses too).

Once one gets a little better in writing valgrind suppressions it's
easy to let valgrind produce less irrelevant output, so perhaps
there's a future for that. Or perhaps a tool to query and sort
valgrind reports after the fact (basically filter after the fact).
Perhaps it's time for William to hire someone again who is really good
at this stuff, because mathematically it's utterly uninteresting work
(and it really is finding and cleaning other people's mess)

Ivan Andrus

unread,
Nov 17, 2012, 3:20:01 PM11/17/12
to sage-...@googlegroups.com
At one point I had the goal of creating a suppressions file so that the doctests passed "cleanly". I'm sure some of the suppressions were actual problems, but it would at least allow you to find new problems. I still have the scripts that I used to collect and remove duplicate suppressions. I would be happy to run them again if people thought it would be useful. Sadly my machine isn't the fastest, so it takes quite a while (running all the doctests under valgrind is _slow_). I never did make it all the way through the test suite. But especially if I knew the likely areas it wouldn't be too hard to run some overnight and see what turns up.

-Ivan

Nils Bruin

unread,
Nov 17, 2012, 3:56:14 PM11/17/12
to sage-devel
On Nov 17, 12:20 pm, Ivan Andrus <darthand...@gmail.com> wrote:

> At one point I had the goal of creating a suppressions file so that the doctests passed "cleanly".  I'm sure some of the suppressions were actual problems, but it would at least allow you to find new problems.  I still have the scripts that I used to collect and remove duplicate suppressions.  I would be happy to run them again if people thought it would be useful.  Sadly my machine isn't the fastest, so it takes quite a while (running all the doctests under valgrind is _slow_).  I never did make it all the way through the test suite.  But especially if I knew the likely areas it wouldn't be too hard to run some overnight and see what turns up.

Anything that has to do with libsingular. The problem is that OTHER
tests may well exercise this code much better than libsingular's own
doctests.

However, with an unmodified libsingular it's unlikely you'll find
anything. omalloc allocates pages of system memory and then manages
pieces of it by itself. So as far as valgrind is concerned, there is
relatively little allocation/deallocation activity. I think you can go
further and tell valgrind about the functioning of alternative memory
managers. That would improve diagnostics a little. But if the compact
memory layout of omalloc (the compactness is its purpose) isn't
changed, you still have a good chance that an access-after-free refers
to perfectly valid memory (a block that now has been reallocated for a
different purpose)

This is the issue I'm trying to address with malloc-version of
singular. Combined with a malloc implementation that puts blocks on
separate pages, on the edge of the page, unmaps any page upon
deallocation, and tries to avoid reusing or using adjacent logical
pages means that any illegal access is almost sure to segfault. BSD's
gmalloc does that. It seems glibc's malloc with MALLOC_CHECK_=2 or 3
does at least a bit of that.

The real problem here is that we (Simon, Volker or I) don't know for
sure what the refcount and deletion protocols are for Singular
objects. It seems to be the kind of thing that is folklore inside the
Singular group but was never properly documented. Singular was not
designed to be a clean library, but it does seem to be a direction
Singular is heading, so perhaps this might sometime get documented
properly. I just think Sage can't wait for the decade or so that this
is probably going to take.

Ivan Andrus

unread,
Nov 17, 2012, 4:42:07 PM11/17/12
to sage-...@googlegroups.com
Thanks for the explanation. That makes sense. It sounds like there's not much valgrind will help with, but I'll give it a go anyway.

-Ivan

Jeroen Demeyer

unread,
Dec 19, 2012, 5:16:50 AM12/19/12