[erlang-questions] Segfault with Erlang R22

Bekes, Andras G

unread,

Oct 18, 2019, 10:33:58 AM10/18/19

to Erlang Questions

Hi All,

After upgrading to Erlang R22, my software crashes the Erlang VM with Segmentation fault.

It happens rarely, only after several days of test workload, so I can’t really reproduce.

I made more than 10 core dumps so far, loaded them into gdb, and all of them died at these 2 crash points:

Program terminated with signal 11, Segmentation fault.

#0 process_main (x_reg_array=0x20002, f_reg_array=0x2ade29280590) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:4064

4064 if (is_not_tuple(r(0))) {

or

#0 process_main (x_reg_array=0x20002, f_reg_array=0x2) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:5252

5252 c_p->seq_trace_lastcnt = unsigned_val(SEQ_TRACE_TOKEN_SERIAL(c_p));

The software is not doing any tracing when the crash happens, nor does it have any NIFs.

What should be the next step of investigation?

Thanks

Andras G. Bekes, Vice President
Morgan Stanley | Institutional Securities Tech
Lechner Odon fasor 8 | Floor 07
Budapest, 1095
Phone: +36 1 882-0791
Andras...@morganstanley.com
http://mgstn.ly/budapest

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent required and/or permitted under applicable law, to monitor electronic communications, including telephone calls with Morgan Stanley personnel. This message is subject to the Morgan Stanley General Disclaimers available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access the links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you acknowledge that you have read, understand and consent, (where applicable), to the foregoing and the Morgan Stanley General Disclaimers.

Mikael Pettersson

unread,

Oct 18, 2019, 2:03:01 PM10/18/19

to Bekes, Andras G, Erlang Questions

On Fri, Oct 18, 2019 at 4:34 PM Bekes, Andras G
<Andras...@morganstanley.com> wrote:
>
> Hi All,
>
>
>
> After upgrading to Erlang R22, my software crashes the Erlang VM with Segmentation fault.
>
> It happens rarely, only after several days of test workload, so I can’t really reproduce.
>
>
>
> I made more than 10 core dumps so far, loaded them into gdb, and all of them died at these 2 crash points:
>
>
>
> Program terminated with signal 11, Segmentation fault.
>
> #0 process_main (x_reg_array=0x20002, f_reg_array=0x2ade29280590) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:4064
>
> 4064 if (is_not_tuple(r(0))) {
>
>
>
> or
>
>
>
> #0 process_main (x_reg_array=0x20002, f_reg_array=0x2) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:5252
>
> 5252 c_p->seq_trace_lastcnt = unsigned_val(SEQ_TRACE_TOKEN_SERIAL(c_p));
>
>
>
> The software is not doing any tracing when the crash happens, nor does it have any NIFs.

I think you should open a bug report in the erlang bug tracker.

The second crash site above is in the remove_message() function, in a
block where ERL_MESSAGE_TOKEN(msgp)
is neither NIL nor am_undefined, but SEQ_TRACE_TOKEN(c_p) is invalid
(not the expected 5-tuple).

Maybe printing *c_p in gdb when that happens could shed some light.

/Mikael
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Bekes, Andras G

unread,

Oct 24, 2019, 11:52:23 AM10/24/19

to Mikael Pettersson, Erlang Questions

Hi Mikael,

I filed a bug report in the bug tracker: https://bugs.erlang.org/browse/ERL-1074

Unfortunately printing *c_p did not reveal anything:

(gdb) print c_p
$1 = <value optimized out>
(gdb) print *c_p
value has been optimized out

What should be the next step?
I can reliably produce 5-10 core dumps per week in my test system.

--------------------------------------------------------------------------------
NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers If you cannot access these links, please notify us by reply message and we will send the contents to you. By communicating with Morgan Stanley you consent to the foregoing and to the voice recording of conversations with personnel of Morgan Stanley.

Mikael Pettersson

unread,

Oct 24, 2019, 1:10:07 PM10/24/19

to Bekes, Andras G, Erlang Questions

On Thu, Oct 24, 2019 at 4:57 PM Bekes, Andras G
<Andras...@morganstanley.com> wrote:
>
> Hi Mikael,
>
> I filed a bug report in the bug tracker: https://bugs.erlang.org/browse/ERL-1074
>
> Unfortunately printing *c_p did not reveal anything:
>
> (gdb) print c_p
> $1 = <value optimized out>
> (gdb) print *c_p
> value has been optimized out
>
> What should be the next step?
> I can reliably produce 5-10 core dumps per week in my test system.

I'd try to get a backtrace (bt command in gdb) from the crashed
thread, then maybe print
the c_p parameter via its raw value (print *(Process*)0x.....) if gdb
insists that the value
is optimized out.

/Mikael

Bekes, Andras G

unread,

Nov 1, 2019, 2:22:35 PM11/1/19

to Mikael Pettersson, Erlang Questions

Program terminated with signal 11, Segmentation fault.

#0 process_main (x_reg_array=0x20002, f_reg_array=0x2) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:5252

5252 c_p->seq_trace_lastcnt = unsigned_val(SEQ_TRACE_TOKEN_SERIAL(c_p));

Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64
(gdb) bt

#0 process_main (x_reg_array=0x20002, f_reg_array=0x2) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:5252

#1 0x00000000004641a4 in sched_thread_func (vesdp=0x2b8244840200) at beam/erl_process.c:8465
#2 0x000000000069262a in thr_wrapper (vtwd=<value optimized out>) at pthread/ethread.c:118
#3 0x00002b81f80f7dd5 in _L_unlock_48 () from /lib64/libpthread.so.0
#4 0x00002b81f80f5eb3 in __find_thread_by_id () from /lib64/libpthread.so.0
#5 0x0000000000000000 in ?? ()
(gdb)

I am not sure how to " print the c_p parameter via its raw value (print *(Process*)0x.....)".
Where should I take the value 0x..... from?

Eckard Brauer

unread,

Nov 2, 2019, 5:01:34 AM11/2/19

to erlang-q...@erlang.org

It's a few years ago, but IIRC either "print *c_p" or "print
*((Process*) c_p)". Problem would probably be that the processor
already left the stack frame where c_p is valid.

You can do "info stack" at this point and select the frame with "frame
<#>" to try it again. If you're a little familiar with assembly
language, you can even have a look at "disassemble <address>" or
"disassemble function" to get an idea of where values are at what point
in the instruction/processing flow - sometimes this helps too.

I'd investigate starting with frame 2 here, as all frames below are
already in libpthread.

Hope that helps a bit...

Eckard

Am Fri, 1 Nov 2019 18:22:18 +0000
schrieb "Bekes, Andras G" <Andras...@morganstanley.com>:

> [...]

>
> I'd try to get a backtrace (bt command in gdb) from the crashed
> thread, then maybe print
> the c_p parameter via its raw value (print *(Process*)0x.....) if gdb
> insists that the value
> is optimized out.
>
> /Mikael
>

> [...]
> [...]
> [...]

>
> --------------------------------------------------------------------------------
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the
> Dodd-Frank Wall Street Reform and Consumer Protection Act. If you
> have received this communication in error, please destroy all
> electronic and paper copies and notify the sender immediately.
> Mistransmission is not intended to waive confidentiality or
> privilege. Morgan Stanley reserves the right, to the extent permitted
> under applicable law, to monitor electronic communications. This
> message is subject to terms available at the following link:
> http://www.morganstanley.com/disclaimers If you cannot access these
> links, please notify us by reply message and we will send the
> contents to you. By communicating with Morgan Stanley you consent to
> the foregoing and to the voice recording of conversations with
> personnel of Morgan Stanley.

--

Wir haften nicht für die korrekte Funktion der in dieser eMail
enthaltenen Viren. We are not liable for correct function of the
viruses in this email! :)

Bekes, Andras G

unread,

Nov 7, 2019, 11:59:29 AM11/7/19

to Eckard Brauer, erlang-q...@erlang.org

I am not entirely sure of what we're doing, but here is the output:

(gdb) frame 0

#0 process_main (x_reg_array=0x20002, f_reg_array=0x2) at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:5252
5252 c_p->seq_trace_lastcnt = unsigned_val(SEQ_TRACE_TOKEN_SERIAL(c_p));

(gdb) print x_reg_array
$4 = (Eterm *) 0x20002
(gdb) print *x_reg_array
Cannot access memory at address 0x20002
(gdb) print f_reg_array
$5 = (FloatDef *) 0x2
(gdb) print *f_reg_array
Cannot access memory at address 0x2
(gdb) print c_p
$7 = <value optimized out>

(gdb) print *c_p
value has been optimized out

(gdb) frame 1

#1 0x00000000004641a4 in sched_thread_func (vesdp=0x2b8244840200) at beam/erl_process.c:8465

8465 process_main(esdp->x_reg_array, esdp->f_reg_array);
(gdb) print esdp
$8 = (ErtsSchedulerData *) 0x2b8244840200
(gdb) print *esdp
$9 = {x_reg_array = 0x2b823e940200, f_reg_array = 0x2b823e942240, timer_wheel = 0x2b82450f5c80,
next_tmo_ref = 0x2b8245136120, timer_service = 0x2b8245176680, tid = 47838514915072, erl_bits_state = {
byte_buf_ = 0x2b823ad81058 "", byte_buf_len_ = 1, erts_current_bin_ = 0x2b832e60c688 "\n\274\362T",
erts_bin_offset_ = 32, erts_writable_bin_ = 0}, match_pseudo_process = 0x2b823bec7c78, free_process = 0x0,
thr_progress_data = {id = 1, is_managed = 1, is_blocking = 0, is_temporary = 0, wakeup_request = {5707836, 5707869,
5707862, 5707859}, leader = 0, active = 1, confirmed = 5707879, leader_state = {next = 5707875,
current = 18446744073709551615, chk_next_ix = 2, umrefc_ix = {current = 0, waiting = -1}}}, ssi = 0x2b823be7e680,
current_process = 0x2b82431401d8, type = ERTS_SCHED_NORMAL, no = 1, dirty_no = 0, flxctr_slot_no = 1,
current_nif = 0x0, dirty_shadow_process = 0x0, current_port = 0x0, run_queue = 0x2b823be77ec0, virtual_reds = 0,
cpu_id = -1, aux_work_data = {sched_id = 1, esdp = 0x2b8244840200, ssi = 0x2b823be7e680, current_thr_prgr = 5707878,
latest_wakeup = 5707869, misc = {ix = 0, thr_prgr = 18446744073709551615}, dd = {thr_prgr = 5707869}, cncld_tmrs = {
thr_prgr = 5707146}, later_op = {thr_prgr = 5707880, size = 65384, first = 0x2b832e8d76c8,
last = 0x2b832e8d76c8}, async_ready = {need_thr_prgr = 0, thr_prgr = 18446744073709551615,
queue = 0x2b8245059880}, delayed_wakeup = {next = 18446744073709551615, sched2jix = 0x2b82443650c8, jix = -1,
job = 0x2b8244364f00}, yield = {alcu_blockscan = {current = 0x0, last = 0x0}, ets_all = {ongoing = 0x0,
hfrag = 0x0, tab = 0x0, queue = 0x0}}, debug = {wait_completed = {flags = 0, callback = 0, arg = 0x0}}},
atom_cache_map = {hdr_sz = -1, sz = 0, long_atoms = 0, cix = {0 <repeats 2048 times>}, cache = {{atom = 0,
iix = -1} <repeats 2048 times>}}, last_monotonic_time = 54631431230926, check_time_reds = 3137, thr_id = 1,
unique = 251, ref = 1016430404454740281, alloc_data = {deallctr = {0x0, 0x0, 0x0, 0x2b81f77a9200, 0x2b823ad37200,
0x2b823bdaa200, 0x2b823e8d5200, 0x2b8240f13200, 0x2b8242ff9200, 0x0, 0x0, 0x2b823fea0200, 0x2b8241f86200, 0x0},
pref_ix = {0, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1}, flist_ix = {0 <repeats 14 times>}, pre_alc_ix = 0}, io = {
out = 21255996115, in = 21476723837}, pending_signal = {sig = 0x0, to = 0}, reductions = 1006796378,
sched_wall_time = {u = {mod = {counter = 0}, need = 0}, enabled = 0, start = 0, working = {total = 0, start = 0}},
gc_info = {reclaimed = 775481964, garbage_cols = 680476}, nosuspend_port_task_handle = {counter = 0}, ets_tables = {
count = {counter = 0}, clist = 0x0}}
(gdb) print esdp->x_reg_array
$10 = (Eterm *) 0x2b823e940200
(gdb) print *esdp->x_reg_array
$11 = 2522015978211937347
(gdb) print esdp->f_reg_array
$12 = (FloatDef *) 0x2b823e942240
(gdb) print *esdp->f_reg_array
$13 = {fd = 0.002545, fb = "\323\023\226x@\331d?", fs = {5075, 30870, 55616, 16228}, fw = {2023101395, 1063573824},
fdw = 4568014792984761299}

frame 2 is in already in pthread/ethread.c

Jonas Falkevik

unread,

Nov 8, 2019, 4:37:38 AM11/8/19

to Bekes, Andras G, erlang-q...@erlang.org

Have you tried loading the erts gdb scripts?

should be found under "erts/etc/unix/etp-commands"

Then you can get stacktrace from process for example..

(gdb) source <path to etp-commands>

(gdb) set $p = (Process *)0x2b82431401d8

(gdb) etp-stacktrace $p

/Jonas

Reply all

Reply to author

Forward