[erlang-questions] How to debug dirty NIF?

163 views
Skip to first unread message

Max Lapshin

unread,
Aug 28, 2014, 11:19:40 AM8/28/14
to Erlang-Questions Questions
I'm running CPU bound task (make thumbnails from video) in a NIF under  erlang 17 (erts 6.0)

I'm using dirty nif scheduling:

static ERL_NIF_TERM
yuv2jpeg0(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
....
}


static ERL_NIF_TERM
async_jpeg(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
  ERL_NIF_TERM result = yuv2jpeg0(env, argc, argv);
  return enif_schedule_dirty_nif_finalizer(env, result, enif_dirty_nif_finalizer);
}


static ERL_NIF_TERM
yuv2jpeg(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
  return enif_schedule_dirty_nif(env, ERL_NIF_DIRTY_JOB_CPU_BOUND, async_jpeg, argc, argv);
}

I see strange situation: none of CPU core is not 100% loaded, but processes that are calling yuv2jpeg function are hanging in this function:


(flus...@127.0.0.1)3> process_info(pid(0,961,0)).                  

[{current_function,{avcodec,yuv2jpeg0,4}},

 {initial_call,{proc_lib,init_p,5}},

 {status,runnable},

 {message_queue_len,11},

 {messages,[{jpeg,[{video_frame,video,37885047.833333336,

                                37885047.833333336,0,h264,keyframe,1,

                                <<0,0,0,2,9,240,0,0,21,...>>,



I suppose that I could meet the old problem with wrong scheduler behaviour when NIF is using thread for too long.

yuv2jpeg takes usually about 2-4 milliseconds to run.

Is it possible to debug this situation? Can I somehow ask erlang if it decided that some scheduler is considered idle?

I launch erlang so:

erlang/erts-6.0/bin/beam.smp -K true -Bi -A 100 -a 32 -stbt s -MBas aoffcaobf -MBacul 0



Steve Vinoski

unread,
Aug 29, 2014, 11:55:14 AM8/29/14
to Max Lapshin, Erlang-Questions Questions
On Thu, Aug 28, 2014 at 11:19 AM, Max Lapshin <max.l...@gmail.com> wrote:
I'm running CPU bound task (make thumbnails from video) in a NIF under  erlang 17 (erts 6.0)

I'm using dirty nif scheduling:

static ERL_NIF_TERM
yuv2jpeg0(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
....
}


static ERL_NIF_TERM
async_jpeg(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
  ERL_NIF_TERM result = yuv2jpeg0(env, argc, argv);
  return enif_schedule_dirty_nif_finalizer(env, result, enif_dirty_nif_finalizer);
}


static ERL_NIF_TERM
yuv2jpeg(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
  return enif_schedule_dirty_nif(env, ERL_NIF_DIRTY_JOB_CPU_BOUND, async_jpeg, argc, argv);
}

I see strange situation: none of CPU core is not 100% loaded, but processes that are calling yuv2jpeg function are hanging in this function:


(flus...@127.0.0.1)3> process_info(pid(0,961,0)).                  

[{current_function,{avcodec,yuv2jpeg0,4}},


There's always the "printf debugger" :) -- have your entry NIF and your dirty NIF print their respective thread IDs to make sure they're on different threads, and have each print something just before returning. You might also consider writing your own finalizer so you can print from there too. With these in place, you'll at least know where something might be getting hung up.

A more involved alternative, though still not all that difficult, would be to build a debuggable runtime and run everything under gdb.

I suppose that I could meet the old problem with wrong scheduler behaviour when NIF is using thread for too long.


Scheduler collapse is possible with regular schedulers but not with dirty schedulers, as the whole point of having the latter is that they're not bound to the regular scheduler constraints. But if your regular schedulers are already collapsed, they'll never run and so will never switch your processes over to a dirty scheduler.

yuv2jpeg takes usually about 2-4 milliseconds to run.

Is it possible to debug this situation? Can I somehow ask erlang if it decided that some scheduler is considered idle?


You might get some useful info from erlang:system_info using arguments such as schedulers_state, scheduling_statistics, and thread_progress, and maybe also from erlang:statistics(run_queues). See http://www.erlang.org/doc/man/erlang.html for details on these.

If there's a way for you to easily package something that I could run for myself to try to duplicate the problem, just let me know and I'll take a look.

--steve

Max Lapshin

unread,
Aug 30, 2014, 1:50:43 AM8/30/14
to Steve Vinoski, Erlang-Questions Questions
I couldn't find any detailed information about the problem, but I've cured it somehow by adding:

erlang:bump_reductions(5000) after each NIF call.

Scott Lystig Fritchie

unread,
Sep 1, 2014, 5:45:30 AM9/1/14
to Max Lapshin, Erlang-Questions Questions, Steve Vinoski
Hi, Max. Regarding collapse of the regular/non-dirty Erlang schedulers,
we've found this daemon'ish thing quite useful. It can detect scheduler
collapse, though it must create a bunch of nonsense busy work Erlang
processes to measure it. And it can fix the problem temporarily
... until it happens again.

https://gist.github.com/slfritchie/d834ec3f8a4321b1c2c2

-Scott
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Max Lapshin

unread,
Sep 1, 2014, 9:57:55 AM9/1/14
to Scott Lystig Fritchie, Erlang-Questions Questions, Steve Vinoski
Thank you a lot!

I'm also thinking about rewriting code to drivers + async threads.
Reply all
Reply to author
Forward
0 new messages