Unable to start

6 views
Skip to first unread message

Marco Shaw

unread,
Sep 24, 2019, 10:31:22 AM9/24/19
to riak-...@googlegroups.com
**RIAK NOVICE** (One could say "accidental" RIAK administrator.)

I'm unable to restart RIAK on a server after recent Ubuntu 16 patching.  This is 1/10 nodes (2 "clusters"), so I'm not concerned it's a patching issue.

I have AAE enabled.  I tried to clear /etc/riak/data/anti_entropy twice, but no luck restarting the process.

Googling around, I came across:

Just for fun, I tried "riak attach", but it seems
root@[REMOVED]:/var/log/riak# riak attach
Node is not running!
root@[REMOVED]:/var/log/riak#

Is that a "supervisor" function and I can't even see if I can try this on the index number found below (yes, I see the above link "warns" that AAE should be able to recover automagically)?
riak_kv_vnode:repair(»Partition ID«)

What appears relevant to me:
console.log:
2019-09-24 04:59:07.635 [error] <0.880.0>@riak_kv_vnode:init:442 Failed to start riak_kv_eleveldb_backend backend for index 296867520082839655260123481645494988367611297792 error: {db_open,"Corruption: truncated record at end of file"}
2019-09-24 04:59:07.635 [notice] <0.880.0>@riak:stop:43 "backend module failed to start."
2019-09-24 04:59:07.635 [error] <0.880.0> gen_fsm <0.880.0> in state started terminated with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"Corruption: truncated record at end of file"}}}, undefined) line 1071
2019-09-24 04:59:07.635 [error] <0.880.0> CRASH REPORT Process <0.880.0> with 0 neighbours exited with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"Corruption: truncated record at end of file"}}}, undefined) line 1071 in gen_fsm:terminate/7 line 600
2019-09-24 04:59:07.637 [error] <0.241.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.880.0> exit with reason no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"Corruption: truncated record at end of file"}}}, undefined) line 1071 in context child_terminated
2019-09-24 04:59:07.640 [error] <0.238.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.265.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"Corruption: truncated record at end of file"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1071}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,907}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_fsm,sync_send_event,[<0.880.0>,wait_for_init,infinity]}} in context child_terminated

crash.log:
2019-09-24 04:55:37 =ERROR REPORT====
** State machine <0.843.0> terminating
** Last event in was timeout
** When State == started
**      Data  == {state,296867520082839655260123481645494988367611297792,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}
** Reason for termination =
** {function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"Corruption: truncated record at end of file"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1071}]},{riak_core_
vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,907}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
2019-09-24 04:55:37 =CRASH REPORT====
  crasher:
    initial call: riak_core_vnode:init/1
    pid: <0.843.0>
    registered_name: []
    exception exit: {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"Corruption: truncated record at end of file"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1
071}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,907}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,2
39}]}]},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,600}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
    ancestors: [riak_core_vnode_sup,riak_core_sup,<0.228.0>]
    messages: [{'$gen_sync_event',{<0.841.0>,#Ref<0.0.0.2863>},wait_for_init}]
    links: [<0.232.0>]
    dictionary: [{random_seed,{26318,29506,29332}}]
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 8091
  neighbours:

Martin Sumner

unread,
Sep 24, 2019, 11:01:41 AM9/24/19
to Marco Shaw, riak-users
Marco,

Looks like you have a corrupted leveldb database for the partition 296867520082839655260123481645494988367611297792. https://docs.riak.com/riak/kv/2.2.3/using/repair-recovery/repairs/#repairing-leveldb tells you how to directly run repair on the leveldb backend.

Note though that others have in the past suggested a more direct approach where you already have resilience - http://lists.basho.com/pipermail/riak-users_lists.basho.com/2019-September/039372.html

Personally I would try and repair leveldb first which can be quicker, but it may not always work.  Bryan's suggested approach of just deleting the vnode should work in all cases, but you have higher risks of inconsistent results during the recovery process.

Regards

Martin



--
You received this message because you are subscribed to the Google Groups "riak-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to riak-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/riak-users/CAG5NM6S77pCzqD8s97vmKTMOQxAm%2B4sQ%2BPfXUXeETyn%2B2vC2rg%40mail.gmail.com.

Marco Shaw

unread,
Sep 24, 2019, 11:06:56 AM9/24/19
to Martin Sumner, riak-users
Thank you for responding so quickly.  Is this something I would run from another node?

root@[REMOVED]:/var/log/riak# riak-admin repair-2i 296867520082839655260123481645494988367611297792
Node is not running!
root@[REMOVED]:/var/log/riak# riak-admin repair-2i
Node is not running!

Martin Sumner

unread,
Sep 24, 2019, 11:33:46 AM9/24/19
to Marco Shaw, riak-users

Sorry, the link was meant to be a link to the sub-headline "Repairing leveldb" and the stuff below it.

That repair-2i command is from the previous section (Repairing Secondary Indexes), and isn't relevant here.
Reply all
Reply to author
Forward
0 new messages