Understanding segmentation fault

116 views
Skip to first unread message

Carlo Giesa

unread,
May 24, 2024, 10:29:27 AM5/24/24
to gaffer-dev
Hi there!

I'm currently doing tests to switch our pipeline to Rocky 9 and unfortunately I'm hitting a wall with Gaffer.

I still need to figure out some details on my side but maybe someone could already give me some hints about how to proceed on this one.

We have a Gaffer template for model qcing and I'm able to execute it in command line on my workstation which runs on Rocky 9. But when I execute the exact same command line on a render node, it produce a segmentation fault. The problem is that there is nothing that tells where this segmentation fault comes from. The exact same configuration runs on the farm when I use a render node that has Alma Linux 8 installed.

I increased Arnold verbosity to a maximum and I set also IECORE_LOG_LEVEL=DEBUG, but I get no more information about what went wrong.

I'm really not a C++ debugging specialist, maybe there are some tricks to get a bit more information about this?

Thanks already for any input.

Greets,
Carlo

Carlo Giesa

unread,
May 24, 2024, 10:35:34 AM5/24/24
to gaffer-dev
Ok, I found some 'how to debug with Gaffer' information in this group!

I could get following stack trace:

#0  0x00007ffff7f7bf6e in malloc (size=576) at src/jemalloc.c:926
#1  0x00007ffff7fddab1 in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
#2  _dl_resize_dtv (dtv=dtv@entry=0x7ffff7b08a40, max_modid=max_modid@entry=20) at ../elf/dl-tls.c:499
#3  0x00007ffff7fde3f0 in _dl_update_slotinfo (req_modid=1, new_gen=13) at ../elf/dl-tls.c:815
#4  0x00007ffff7fde4ec in update_get_addr (ti=0x7ffff7fbcfb0, gen=<optimized out>) at ../elf/dl-tls.c:922
#5  0x00007ffff7fc9eac in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#6  0x00007ffff7f7c173 in je_tcache_get (create=true) at include/jemalloc/internal/tcache.h:143
#7  je_arena_malloc (try_tcache=true, zero=false, size=576, arena=0x0) at include/jemalloc/internal/arena.h:956
#8  je_imalloct (arena=0x0, try_tcache=true, size=576) at include/jemalloc/internal/jemalloc_internal.h:771
#9  je_imalloc (size=576) at include/jemalloc/internal/jemalloc_internal.h:780
#10 malloc (size=576) at src/jemalloc.c:929
#11 0x00007ffff7fddab1 in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
#12 _dl_resize_dtv (dtv=dtv@entry=0x7ffff7b08a40, max_modid=max_modid@entry=20) at ../elf/dl-tls.c:499
#13 0x00007ffff7fde3f0 in _dl_update_slotinfo (req_modid=1, new_gen=13) at ../elf/dl-tls.c:815
#14 0x00007ffff7fde4ec in update_get_addr (ti=0x7ffff7fbcfb0, gen=<optimized out>) at ../elf/dl-tls.c:922
#15 0x00007ffff7fc9eac in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#16 0x00007ffff7f7c173 in je_tcache_get (create=true) at include/jemalloc/internal/tcache.h:143
#17 je_arena_malloc (try_tcache=true, zero=false, size=576, arena=0x0) at include/jemalloc/internal/arena.h:956
#18 je_imalloct (arena=0x0, try_tcache=true, size=576) at include/jemalloc/internal/jemalloc_internal.h:771
#19 je_imalloc (size=576) at include/jemalloc/internal/jemalloc_internal.h:780
#20 malloc (size=576) at src/jemalloc.c:929
#21 0x00007ffff7fddab1 in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
#22 _dl_resize_dtv (dtv=dtv@entry=0x7ffff7b08a40, max_modid=max_modid@entry=20) at ../elf/dl-tls.c:499
#23 0x00007ffff7fde3f0 in _dl_update_slotinfo (req_modid=1, new_gen=13) at ../elf/dl-tls.c:815
#24 0x00007ffff7fde4ec in update_get_addr (ti=0x7ffff7fbcfb0, gen=<optimized out>) at ../elf/dl-tls.c:922
#25 0x00007ffff7fc9eac in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#26 0x00007ffff7f7c173 in je_tcache_get (create=true) at include/jemalloc/internal/tcache.h:143
#27 je_arena_malloc (try_tcache=true, zero=false, size=576, arena=0x0) at include/jemalloc/internal/arena.h:956
#28 je_imalloct (arena=0x0, try_tcache=true, size=576) at include/jemalloc/internal/jemalloc_internal.h:771
#29 je_imalloc (size=576) at include/jemalloc/internal/jemalloc_internal.h:780
#30 malloc (size=576) at src/jemalloc.c:929
#31 0x00007ffff7fddab1 in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
#32 _dl_resize_dtv (dtv=dtv@entry=0x7ffff7b08a40, max_modid=max_modid@entry=20) at ../elf/dl-tls.c:499
#33 0x00007ffff7fde3f0 in _dl_update_slotinfo (req_modid=1, new_gen=13) at ../elf/dl-tls.c:815
#34 0x00007ffff7fde4ec in update_get_addr (ti=0x7ffff7fbcfb0, gen=<optimized out>) at ../elf/dl-tls.c:922
#35 0x00007ffff7fc9eac in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#36 0x00007ffff7f7c173 in je_tcache_get (create=true) at include/jemalloc/internal/tcache.h:143
#37 je_arena_malloc (try_tcache=true, zero=false, size=576, arena=0x0) at include/jemalloc/internal/arena.h:956
#38 je_imalloct (arena=0x0, try_tcache=true, size=576) at include/jemalloc/internal/jemalloc_internal.h:771
#39 je_imalloc (size=576) at include/jemalloc/internal/jemalloc_internal.h:780
#40 malloc (size=576) at src/jemalloc.c:929
#41 0x00007ffff7fddab1 in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
#42 _dl_resize_dtv (dtv=dtv@entry=0x7ffff7b08a40, max_modid=max_modid@entry=20) at ../elf/dl-tls.c:499
#43 0x00007ffff7fde3f0 in _dl_update_slotinfo (req_modid=1, new_gen=13) at ../elf/dl-tls.c:815
#44 0x00007ffff7fde4ec in update_get_addr (ti=0x7ffff7fbcfb0, gen=<optimized out>) at ../elf/dl-tls.c:922
#45 0x00007ffff7fc9eac in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#46 0x00007ffff7f7c173 in je_tcache_get (create=true) at include/jemalloc/internal/tcache.h:143
#47 je_arena_malloc (try_tcache=true, zero=false, size=576, arena=0x0) at include/jemalloc/internal/arena.h:956
#48 je_imalloct (arena=0x0, try_tcache=true, size=576) at include/jemalloc/internal/jemalloc_internal.h:771
#49 je_imalloc (size=576) at include/jemalloc/internal/jemalloc_internal.h:780

Not sure if that is of any help.

Greets,
Carlo

Carlo Giesa

unread,
May 24, 2024, 10:56:33 AM5/24/24
to gaffer-dev
I tried with Gaffer 1.3.11.0 and 1.3.16.3. Didn't check with the gcc11 versions yet. I just wanted to change as little as possible to avoid having too many possible sources of errors.

Carlo Giesa

unread,
May 24, 2024, 12:27:01 PM5/24/24
to gaffer-dev
And I just realize that this looks like a endless loop. I just copied the first 50 entries, but this goes on and on and on. I'm still printing the entire stack trace and I am at about line 300.000.

Carlo Giesa

unread,
May 24, 2024, 12:30:45 PM5/24/24
to gaffer-dev
So, finally, it came to an end. I added the beginning until the endless loop starts.
stack_trace_gaffer_render_crash.txt

Murray Stevenson

unread,
May 24, 2024, 1:33:52 PM5/24/24
to gaffer-dev
Hi Carlo,

Are you running different versions of Rocky 9 on your workstations and render nodes? If the render nodes are on Rocky 9.4 then this looks to be the same issue that Robert mentions over here https://groups.google.com/g/gaffer-dev/c/EpFcQsnPQEU/m/0U_B7kRkDAAJ, which is related to glibc updates in RHEL 9.4 causing a crash in the Jemalloc memory allocator used by Gaffer.

We've patched Jemalloc to run on 9.4 for Gaffer 1.4.4.0, and you should be able to run older versions of Gaffer by disabling Jemalloc with the `GAFFER_JEMALLOC=0` environment variable.

Cheers,

Murray

Robert Kolbeins

unread,
May 24, 2024, 1:39:27 PM5/24/24
to gaffe...@googlegroups.com
Have you tried export `GAFFER_JEMALLOC=0` 

--
You received this message because you are subscribed to the Google Groups "gaffer-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gaffer-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gaffer-dev/0cc65db1-576b-426d-9ad8-dbc6158da24fn%40googlegroups.com.

Murray Stevenson

unread,
May 24, 2024, 1:41:55 PM5/24/24
to gaffer-dev
Robert's reply was actually sent a few hours ago but was caught up in Google's spam filter... Thanks for jumping in! :)

M

Carlo Giesa

unread,
May 25, 2024, 4:38:28 AM5/25/24
to gaffe...@googlegroups.com
Hi there!

Oh yeah, of course, I should have thought of that before. But I thought that this was onyl related to Gaffer 1.4.x.x releases which I did not test yet (trying to liimit the amount difference while doing tests on Rocky 9). I will try that out on Monday morning and keep you posted.

And to be honest, I'm not entirely sure which versions of Rocky we are exactly using between workstations and render nodes. I remember that we had issues with zombie machines (running but not doing anything) on the farm with a more recent versions of Rocky 9. I can double check on Monday with our IT guy.

Thanks a lot for your input! I'll keep you posted.

Greets,
Carlo

Carlo Giesa

unread,
May 27, 2024, 10:43:12 AM5/27/24
to gaffe...@googlegroups.com
Alright, just for your info. Setting 'GAFFER_JEMALLOC=0' did fix the issue. Thanks for your input. I'll be able to work around this until we are able to update to the latest version of Gaffer. Just out of curiosity, does this have any negative impact on the execution of Gaffer? Will this result in slower execution or does this have a bigger impact on memory usage?

Greets,
Carlo

John Haddon

unread,
May 28, 2024, 6:30:46 AM5/28/24
to gaffe...@googlegroups.com
In our (somewhat limited) testing, Jemalloc does give a bit of a performance boost to Gaffer, but I'm not sure it consumes less memory. You can see the latest results for one workload here : https://github.com/GafferHQ/dependencies/pull/263#issuecomment-2109771401.
Cheers...
John

Carlo Giesa

unread,
May 28, 2024, 9:22:38 AM5/28/24
to gaffe...@googlegroups.com
Thanks John for the details. I think that we can live with this until we switch to the latest Gaffer version.

Greets,
Carlo

Reply all
Reply to author
Forward
0 new messages