Hi all,
If anyone is up for a good challenge (and the reason why I got a bit behind on working on my Enzo-E PR's...):
I've been trying to track down a possible memory leak for the past week or so (in Enzo) but haven't really be able to make any solid progress on definitively diagnosing the issue. This is in large part because I can only reproduce the seg-fault (which is either a "free(): invalid pointer:" or "double free or corruption") on a large run (256^3 with 9 levels and with radiative transfer) and it 1) only occurs when running with optimization O1 or O2 (though a problem may still exist in O0), and 2) when running on > 2 processors.
I've spent a very long time using both valgrind and address sanitizer to try and get to the bottom of the issue on smaller problems, but with no luck. This is partly because I'm pretty unfamiliar with the routines that seem to be possibly causing the problem. I was wondering if anyone has experience with the below bits of code to say if there is indeed an issue or not:
1) I can get a core dump when the seg-fault occurs. Most of the core files contain no backtrace (just something like "<class 'gdb.MemoryError'> Cannot access memory at address 0x7ffe2ca17718:"), but usually one h
as a backtrace that points to line 495 in
Grid_DepositParticlePositions.C (`
delete [] ParticleMassTemp;`) . Given this is a memory error this isn't necessarily the problem point -- just the point where whatever is going wrong pops up. it looks like this file has been only minimally changed since 2016. I tried initializing pointers to NULL and doing NULL checks before the delete but this does not change behavior.
2) Running valgrind locally on a separate problem, I get an explosion of weird memory errors stemming from
CommunicationTransferPhotons.C that
look something like this . Where it looks like memory gets corrupted all over the place giving "Conditional move or jump on uninitialized values" in several very unrelated routines (there is nothing special about the grid::ComputePressure in that example... it pops up in many different places). As you can see, Valgrind points to line 162 in CommunicationTransferPhotons.C, where a photon send list is allocated `
SendList[proc] = new GroupPhotonList[nPhoton[proc]];` But again, not much has changed in this routine lately and I can't seem to see any issues in this code. So I'm not sure this is actually the problem point or not.
Its very likely I have something screwy in my fork that is the real source of the issue and entirely unrelated to the above, but I was hoping someone could help me definitively rule out the above two bits of code as the problem points.
And I'd take any suggestions anyone may have. I've very nearly exhausted all of my normal strategies for tracking something like this down. Unfortunately I'm very limited in how far back in my code history I can go and re-run to track down the problem commit since the run requires fairly recent additions to work. I've already demonstrated the issue is still present in as far back in the history as I can go.
Best,
Andrew
---
Pasadena Fellow in Theoretical Astrophysics
Carnegie Observatories
California Institute of Technology