Re: Abort after 4 minutes run on a Raspberry Pi

404 views
Skip to first unread message

Xerxes Rånby

unread,
Aug 29, 2012, 2:35:02 AM8/29/12
to av...@googlegroups.com
backtrace from a debug build of avian:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x4c2d9470 (LWP 3480)]
0x40222b04 in vm::setObjectClass (o=0x507294, value=0x4d6ef334)
    at src/machine.h:1881
1881            (cast<object>(o, 0)) & (~PointerMask)));
(gdb) bt
#0  0x40222b04 in vm::setObjectClass (o=0x507294, value=0x4d6ef334)
    at src/machine.h:1881
#1  0x40217b68 in vm::initMonitorNode (t=0x359e3c, o=0x507294, value=0x359e3c,
    next=0x0) at build/linux-arm-debug-openjdk/type-constructors.cpp:469
#2  0x4021d640 in vm::makeMonitorNode (t=0x359e3c, value=0x359e3c, next=0x0)
    at build/linux-arm-debug-openjdk/type-constructors.cpp:1624
#3  0x4027e76c in vm::monitorWait (t=0x359e3c, monitor=0x4d5c574c, time=10)
    at src/machine.h:3059
#4  0x4027ec60 in vm::wait (t=0x359e3c, o=0x4d4a2780, milliseconds=10)
    at src/machine.h:3208
#5  0x402756f0 in (anonymous namespace)::local::jvmWait (t=0x359e3c,
    arguments=0x4c2d8834) at src/classpath-openjdk.cpp:2826
#6  0x402b2bd8 in vmRun ()
   from /usr/lib/jvm/java-7-openjdk-armhf/jre/lib/arm/avian-dbg/libjvm.so
#7  0x4022dc28 in vm::runRaw (t=0x359e3c,
    function=0x402756a0 <(anonymous namespace)::local::jvmWait(vm::Thread*, uintptr_t*)>, arguments=0x4c2d8834) at src/machine.h:1920
#8  0x4022dc90 in vm::run (t=0x359e3c,
    function=0x402756a0 <(anonymous namespace)::local::jvmWait(vm::Thread*, uintptr_t*)>, arguments=0x4c2d8834) at src/machine.h:1927
#9  0x40275760 in (anonymous namespace)::local::JVM_MonitorWait (t=0x359e3c,
    o=0x4c2d8ab0, milliseconds=10) at src/classpath-openjdk.cpp:2838
#10 0x402b2b84 in vmNativeCall ()
   from /usr/lib/jvm/java-7-openjdk-armhf/jre/lib/arm/avian-dbg/libjvm.so
#11 0x401fa14c in vm::dynamicCall (function=0x40275708, arguments=0x4c2d8978,
    argumentTypes=0x4c2d8970 "\a\a\004M<\236\065", argumentCount=3,
    argumentsSize=16, returnType=0) at src/arm.h:252
#12 0x401f8fc8 in (anonymous namespace)::MySystem::call (this=0x12b98,
    function=0x40275708, arguments=0x4c2d8978,
    types=0x4c2d8970 "\a\a\004M<\236\065", count=3, size=16, returnType=0)
    at src/posix.cpp:766
#13 0x402556b4 in (anonymous namespace)::local::invokeNativeSlow (t=0x359e3c,
    method=0x4d6f23d8, function=0x40275708) at src/compile.cpp:7536
#14 0x402559c0 in (anonymous namespace)::local::invokeNative2 (t=0x359e3c,
    method=0x4d6f23d8) at src/compile.cpp:7608
#15 0x40255b54 in (anonymous namespace)::local::invokeNative (t=0x359e3c)
    at src/compile.cpp:7640
#16 0x44c1d080 in ?? ()
#17 0x44c1d080 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) list
1876    {
1877      cast<object>(o, 0)
1878        = reinterpret_cast<object>
1879        (reinterpret_cast<intptr_alias_t>(value)
1880         | (reinterpret_cast<intptr_alias_t>
1881            (cast<object>(o, 0)) & (~PointerMask)));
1882    }
1883   
1884    inline const char*
1885    findProperty(Machine* m, const char* name)

Joel Dice

unread,
Aug 29, 2012, 7:50:12 PM8/29/12
to av...@googlegroups.com, xer...@gudinna.com
On Tue, 28 Aug 2012, xer...@gudinna.com wrote:

> I have been testing to use Avian to run OpenGL ES 2 code using the JogAmp
> JOGL bindings on a Raspberry Pi.
>
> # build avian against openjdk 7
> # I have used the following build instructions to setup and use Avian on a
> Raspberry Pi http://labb.zafena.se/?p=630
>
> # testcase setup:
> wgethttp://jogamp.org/deployment/archive/master/gluegen_584-joal_353-jogl_798-j
> ocl_668-signed/archive/jogamp-all-platforms.7z
> 7z x jogamp-all-platforms.7z
> cd jogamp-all-platforms
> # Download and compile a OpenGL ES 2 vertex and fragment shader introduction
> wgethttps://raw.github.com/xranby/jogl-demos/master/src/demos/es2/RawGL2ES2demo
> .java
> javac -cp jar/jogl-all.jar:jar/gluegen-rt.jar RawGL2ES2demo.java
>
> ### Testcase, crash and Aborts after running 4 minutes, this test is
> reproducible
> # When testing on a Raspberry Pi pass
> -Dnativewindow.ws.name=jogamp.newt.driver.bcm.vc.iv
> # in order to successfully use the Pi Broadcom OpenGL ES 2 driver.
> java -avian -Djava.awt.headless=true
> -Dnativewindow.ws.name=jogamp.newt.driver.bcm.vc.iv \
> -cp jar/jogl-all.jar:jar/gluegen-rt.jar:. RawGL2ES2demo

Hi Xerxes,

Thanks for the thorough bug report. I tried to reproduce it on my x86_64
Linux machine, but it never crashed. Then I tried it on my Raspbian QEMU
emulator, and it crashed, but GDB lost its handle the the process so I
couldn't get a backtrace, just this:

(gdb) r
Starting program: /usr/lib/jvm/java-7-openjdk-armhf/bin/java -avian
-Djava.awt.headless=true
-Dnativewindow.ws.name=jogamp.newt.driver.bcm.vc.iv -cp
jar/jogl-all.jar:jar/gluegen-rt.jar:. RawGL2ES2demo
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/arm-linux-gnueabihf/libthread_db.so.1".
[New Thread 0x47243470 (LWP 1979)]
[New Thread 0x4806c470 (LWP 1980)]
[Thread 0x4806c470 (LWP 1980) exited]
[New Thread 0x48f8f470 (LWP 1990)]
[New Thread 0x4978f470 (LWP 2000)]
[Thread 0x4978f470 (LWP 2000) exited]
[New Thread 0x4806c470 (LWP 2001)]
[Switching to Thread 0x4806c470 (LWP 2001)]
0x4003712c in start_thread () from
/lib/arm-linux-gnueabihf/libpthread.so.0
ptrace: No such process.
(gdb)

I don't have a real Raspberry Pi board, and I'm guessing the Broadcom
OpenGL hardware is emulated in QEMU anyway, so it's probably not a valid
test. If you can give me remote access to some real hardware, I might be
able to make some progress, but I can't tell from the traces you sent
what's wrong or what to try next. The second trace seems to indicate that
an object reference became invalid immediately after the object it
referred to was allocated, but I don't know how that would happen.

Meanwhile, one thing that might be useful is if you could use "thread
apply all bt" to get traces for all the threads at the time of the crash
using the debug build.

Joel Dice

unread,
Sep 3, 2012, 9:42:12 AM9/3/12
to av...@googlegroups.com, xerxes...@gmail.com
On Mon, 3 Sep 2012, xerxes...@gmail.com wrote:

> This is clearly a bug in the GC heap handler.
> s-> remaining() is low� usually only 3 bytes left in nextGen1 thus we have
> ran out of nextGen1 heap while performing the copy during a GC run.

The size of nextGen1 is determined by the minimumNextGen1Capacity
function, which calculates the worst-case capacity needed (i.e. when all
objects are reachable and none can be discarded) based on the current size
of gen1, minus the set of survivors to be moved to gen2, plus the size of
all the objects in thread-local space (calculated in the footprint
function in machine.cpp), plus some padding to account for objects which
will need extra space reserved for their identity hash codes.

Based on the crash you're seeing, the VM must be underestimating the size
needed for the worst case.

> Is there any good debug flags to trace memory leaks, that eventually will
> fill up the heap, inside avian?

You could try building using mode=stress, which will try to GC on every
allocation. It will be very slow, though, and I'm not sure it will help
reproduce the problem. I can't think of anything else besides adding some
printfs to try to figure out where the calculated and actual worst-case
sizes diverge.

I'll try again to reproduce it on a 32-bit system and see what happens.

>
> On Monday, September 3, 2012 10:45:29 AM UTC+2, xerxes...@gmail.com wrote:
>
> (gdb) frame 8
> #8� 0x00358247 in copyTo (c=0x804c004, s=0x804c068,
> o=0xb096fef8, size=18)
> ��� at src/heap.cpp:993
> 993��� � assert(c, s->remaining() >= size);
>
> �
> #6� 0x00324bc9 in vm::assert (s=0x804bc50, v=false) at
> src/system.h:195
> #7� 0x0035789b in assert (c=0x804c004, v=false) at
> src/heap.cpp:758
> #8� 0x00358247 in copyTo (c=0x804c004, s=0x804c068,
> o=0xb096fef8, size=18)
> ��� at src/heap.cpp:993
> #9� 0x0035853c in copy2 (c=0x804c004, o=0xb096fef8) at
> src/heap.cpp:1043
>
> --
> You received this message because you are subscribed to the Google Groups
> "Avian" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/avian/-/xaxHmYXYwcIJ.
> To post to this group, send email to av...@googlegroups.com.
> To unsubscribe from this group, send email to
> avian+un...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/avian?hl=en.
>
>

Joel Dice

unread,
Sep 5, 2012, 3:08:20 PM9/5/12
to av...@googlegroups.com
On Mon, 3 Sep 2012, xerxes...@gmail.com wrote:

> The avian code base became much more stable after the 30 aug 2012 fixes, the
> crash frequency is now down to one crash in 15min.
> I can reproduce a similar crash, this time it hits an assert on both arm and
> i386 using the same JogAmp testcase,
> i think the root cause is a more generic possibly 32bit bug:
>
> (gdb) frame 8
> #8� 0x00358247 in copyTo (c=0x804c004, s=0x804c068, o=0xb096fef8, size=18)
> ��� at src/heap.cpp:993
> 993��� � assert(c, s->remaining() >= size);
> (gdb) frame 7
> #7� 0x0035789b in assert (c=0x804c004, v=false) at src/heap.cpp:758
> 758��� � assert(c->system, v);
> (gdb)
> (gdb) frame 6
> #6� 0x00324bc9 in vm::assert (s=0x804bc50, v=false) at src/system.h:195
> 195��� � expect(s, v);
> (gdb) frame 5
> #5� 0x00324ba7 in vm::expect (s=0x804bc50, v=false) at src/system.h:182
> 182��� � if (UNLIKELY(not v)) abort(s);

I haven't been able to reproduce this exact problem, but I ran into the
original SIGSEGV a few times. I had a hunch that it might have something
to do with

https://github.com/ReadyTalk/avian/commit/1f1c3c4c414a62643f279f1527c43ff06788d016

so I tried the attached patch, and now it's not crashing anymore. I'm not
satisfied with that, though, because I don't see a clear relationship
between that patch and the crashes. Would you mind applying it and
letting me know if you still see problems?

Thnaks.
detach.patch

Joel Dice

unread,
Sep 6, 2012, 9:05:45 PM9/6/12
to av...@googlegroups.com
On Thu, 6 Sep 2012, xerxes...@gmail.com wrote:
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7bccb40 (LWP 19533)]
> 0x00271e58 in ?? () from /lib/i386-linux-gnu/libc.so.6
> (gdb) bt
> #0� 0x00271e58 in ?? () from /lib/i386-linux-gnu/libc.so.6
> #1� 0x06728e27 in ?? () from /usr/lib/i386-linux-gnu/dri/i965_dri.so
> #2� 0x0673c42f in brw_upload_state ()
> �� from /usr/lib/i386-linux-gnu/dri/i965_dri.so
> #3� 0x067276f7 in brw_draw_prims ()
> �� from /usr/lib/i386-linux-gnu/dri/i965_dri.so
> #4� 0x068dd69e in ?? () from /usr/lib/i386-linux-gnu/dri/libdricore.so
> #5� 0x04861f9d in
> Java_jogamp_opengl_gl4_GL4bcImpl_dispatch_1glDrawArrays1(int0_t, __complex)
> ()
> �� from/tmp/jogamp_0000/file_cache/jln507172062703352522/jln4809309549775336484/li
> bjogl_desktop.so
> #6� 0x003c0ed3 in vmNativeCall ()
> �� from /usr/lib/jvm/java-7-openjdk-i386/jre/lib/i386/avian-dbg/libjvm.so
> #7� 0x0032493c in vm::dynamicCall (function=0x4861f6c, arguments=0x7bcbb70,
> ��� argumentsSize=28, returnType=0) at src/x86.h:97
> #8� 0x00323fd6 in call (this=0x804bc50, function=0x4861f6c,
> ��� arguments=0x7bcbb70, types=0x7bcbb50 "\a\a\003\003\003\004\064",
> count=6,
> ��� size=28, returnType=0) at src/posix.cpp:766
> #9� 0x0037450a in invokeNativeSlow (t=0x822f5f4, method=0xae709ae8,
> ��� function=0x4861f6c) at src/compile.cpp:7536
> #10 0x0037482a in invokeNative2 (t=0x822f5f4, method=0xae709ae8)
> ��� at src/compile.cpp:7608
> #11 0x0037498c in invokeNative (t=0x822f5f4) at src/compile.cpp:7640
> #12 0x00425051 in ?? ()

I'm seeing something similar. Here's what I think the problem is:

GL2ES2.glVertexAttribPointer (which is used in RawGL2ES2demo to draw a
triangle) ultimately results in a call to
GL4bcImpl.dispatch_glVertexAttribPointer1, which is a native method
implemented as
Java_jogamp_opengl_gl4_GL4bcImpl_dispatch_1glVertexAttribPointer1__IIIZILjava_lang_Object_2IZJ
in GL4bcImpl_JNI.c, which is generated as part of the JOGL build. Here's
the generated code:

JNIEXPORT void JNICALL
Java_jogamp_opengl_gl4_GL4bcImpl_dispatch_1glVertexAttribPointer1__IIIZILjava_lang_Object_2IZJ(JNIEnv
*env, jobject _unused, jint index, jint size, jint type, jboolean
normalized, jint stride, jobject pointer, jint pointer_byte_offset,
jboolean pointer_is_nio, jlong procAddress) {
typedef void (APIENTRY*_local_PFNGLVERTEXATTRIBPOINTERPROC)(GLuint
index, GLint size, GLenum type, GLboolean normalized, GLsizei stride,
const GLvoid * pointer);
_local_PFNGLVERTEXATTRIBPOINTERPROC ptr_glVertexAttribPointer;
GLvoid * _pointer_ptr = NULL;
if ( NULL != pointer ) {
_pointer_ptr = (GLvoid *) (((char*) ( JNI_TRUE == pointer_is_nio ?
(*env)->GetDirectBufferAddress(env, pointer) :
(*env)->GetPrimitiveArrayCritical(env, pointer, NULL) ) ) +
pointer_byte_offset);
}
ptr_glVertexAttribPointer = (_local_PFNGLVERTEXATTRIBPOINTERPROC)
(intptr_t) procAddress;
assert(ptr_glVertexAttribPointer != NULL);
(* ptr_glVertexAttribPointer) ((GLuint) index, (GLint) size, (GLenum)
type, (GLboolean) normalized, (GLsizei) stride, (GLvoid *) _pointer_ptr);
if ( JNI_FALSE == pointer_is_nio && NULL != pointer ) {
(*env)->ReleasePrimitiveArrayCritical(env, pointer, _pointer_ptr,
JNI_ABORT); }
}

Yes, it's an ugly mess, but if you look closely, you can see that if
pointer_is_nio is false (and GDB is telling me it it indeed false), it
uses GetPrimitiveArrayCritical to get a direct pointer to the buffer.
That's fine, but then it releases that pointer with a call to
ReleasePrimitiveArrayCritical right after calling glVertexAttribPointer.
That can be a problem later when glDrawArrays is called, because the
native OpenGL implementation may now have a stale pointer to an object
which has been moved to a different place in memory due to a garbage
collection cycle.

Valgrind also revealed a similar bug in
Java_jogamp_opengl_x11_glx_GLX_dispatch_1glXGetProcAddress1__Ljava_lang_String_2J,
which calls ReleaseStringUTFChars to release the temporary char array
passed to glXGetProcAddress, even though Mesa holds on to that pointer and
references it in subsequent calls to glXGetProcAddress (which may just be
a bug in Mesa rather than JOGL).

Anyway, from what I can see this is a JOGL bug.

Xerxes Rånby

unread,
Sep 7, 2012, 7:33:55 PM9/7/12
to av...@googlegroups.com
Den fredagen den 7:e september 2012 kl. 03:05:10 UTC+2 skrev Joel Dice:
Yes, it's an ugly mess, but if you look closely, you can see that if
pointer_is_nio is false (and GDB is telling me it it indeed false), it
uses GetPrimitiveArrayCritical to get a direct pointer to the buffer.
That's fine, but then it releases that pointer with a call to
ReleasePrimitiveArrayCritical right after calling glVertexAttribPointer.
That can be a problem later when glDrawArrays is called, because the
native OpenGL implementation may now have a stale pointer to an object
which has been moved to a different place in memory due to a garbage
collection cycle.

Valgrind also revealed a similar bug in
Java_jogamp_opengl_x11_glx_GLX_dispatch_1glXGetProcAddress1__Ljava_lang_String_2J,
which calls ReleaseStringUTFChars to release the temporary char array
passed to glXGetProcAddress, even though Mesa holds on to that pointer and
references it in subsequent calls to glXGetProcAddress (which may just be
a bug in Mesa rather than JOGL).

Anyway, from what I can see this is a JOGL bug.


Thank you for your great attention to detail!
We are now discussing this issue internally inside the JogAmp community.
According to OpenGL spec there is no requirement that the pointer in user space should be still valid after the calls, hence a bug in mesa. But we also acnowledge that mesa may not change over night and there might be other driver implementations that assume the pointers stay valid for performance reasons. In the end it might well be that JogAmp JOGL will have to resort it using NIO direct byte buffers that should stay unmoved after GC runs.

Joel Dice

unread,
Sep 9, 2012, 2:05:06 PM9/9/12
to av...@googlegroups.com
How will you ensure that those buffers remain reachable (i.e. ineligible
for collection) for as long as necessary to allow calls to e.g.
glDrawArrays to succeed? Based on my limited undrestanding of the OpenGL
API, I guess it would be necessary for JOGL to maintain a mapping of
indexes to buffers internally and retain a given mapping until it is
overwritten by application code. It might actually be more efficient to
handle this in the native code and allocate the array passed to OpenGL
using malloc, freeing it only when the index it belongs to is overwritten
(or when the whole stack is disposed of).

Anyway, I wanted to point out that there are actually two problems to
address: buffers can move during GC, and they can also be collected if
they become unreachable.
Reply all
Reply to author
Forward
0 new messages