Error: Corruption: truncated record at end of file

14 views
Skip to first unread message

Marco Shaw

unread,
Dec 13, 2018, 9:50:21 AM12/13/18
to riak-...@googlegroups.com
I'm getting this non-stop error on one node (out of 10):
2018-12-13 06:46:52.303 [error] <0.5427.713> CRASH REPORT Process <0.5427.713> with 0 neighbours exited with reason: no match of right hand value {error,{db_open,"Corruption: truncated record at end of file"}} in hashtree:new_segment_store/2 line 555 in gen_server:init_it/6 line 328

My CPU is relatively high, but the bigger issue is the beam.smp process takes more and more memory and within 72 hours, it crashes.

Would deleting the anti_entropy directory be a first step?


I've restarted the VM twice in the last week or so, but the memory just climbs until there's a OOM error with beam.smp crashing.

Martin Sumner

unread,
Dec 13, 2018, 10:46:56 AM12/13/18
to marco...@gmail.com, riak-...@googlegroups.com
Marco,

First thing would be to use riak-admin top (http://docs.basho.com/riak/kv/2.2.3/using/admin/riak-admin/#top) sorted by memory to see where the memory is being used.  Advice would then vary depending on Riak version, Riak backend and possibly a host of other factors - so some more details of your setup would help. 

--
You received this message because you are subscribed to the Google Groups "riak-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to riak-users+...@googlegroups.com.
To post to this group, send email to riak-...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/riak-users/CAG5NM6RK0y07arWiJb%2BHzRWJtXhk%2BnkUJ_Y9DFZ%2BaNG0Qh8xpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marco Shaw

unread,
Dec 13, 2018, 11:06:37 AM12/13/18
to riak-...@googlegroups.com
I'm an "accidental RIAK administrator", so bear with me. ;-)

What I see in top sorted by memory:
<6563.242.0>        application_master:start_     '-'          0    2546272          0 application_master:loop_it/4
**and quite a few of these:**
<6563.956.0>        riak_kv_index_hashtree:in     '-'        382    1574216          0 gen_server:loop/6

I know I have 10 nodes in total, but they are two groups of 5 (for failover purposes).  I don't quite know how this was all setup, but will try to dig deeper.

It would appear we have version 2.0.6-1 (yes, that appears quite old).

PS: How do you "kill" riak-admin top?  Ctrl-C + a?  Ctrl-C acts weird with this version, it seems: you just get once chance to break out properly.

Marco Shaw

unread,
Dec 13, 2018, 11:10:43 AM12/13/18
to riak-...@googlegroups.com
There's 22 of these before the memory size drops off:
<6563.*.0>        riak_kv_index_hashtree:in     '-'        382    1574216          0 gen_server:loop/6

Martin Sumner

unread,
Dec 13, 2018, 11:39:28 AM12/13/18
to marco...@gmail.com, riak-...@googlegroups.com
ctrl + c followed by `a` to abort the top.

How much memory do you have on each node?  There should be a header like this as well to the top output, which would be useful to see.  

 Load:  cpu         0               Memory:  total      112021    binary       1093

        procs    1053                        processes   63016    code        17363

        runq        0                        atom          735    ets          7784


...



The processes don't appear to be using much memory.  But if this is growing over time, it might be worth sending the output that you see from top later in the day so we can see where it is growing.

One other check is - do you use the solr search facility (yokozuna)?

Marco Shaw

unread,
Dec 13, 2018, 11:57:19 AM12/13/18
to riak-...@googlegroups.com
Each node has 20GB. 

riak-admin top:

 'riak@xxxx'                                                       16:55:36
 Load:  cpu        45               Memory:  total      269050    binary      37704
        procs    1483                        processes  123804    code        22408
        runq        1                        atom          871    ets          3630

UNIX top command:
 1406 riak      20   0 10.663g 8.773g 486476 S 134.9 44.7   2272:34 beam.smp
  206 root      20   0       0      0      0 S   3.0  0.0  46:46.83 jbd2/dm-0-8
 1683 riak      20   0    4340   1464   1376 S   1.7  0.0  45:14.58 memsup

My other nodes are roughly 4.5GB memory used, while this node started at roughly 1.5GB yesterday after a restart and 24 hours later is almost 10x.

I see the free value slowly creeping down:

root@xxxx:~# free
             total       used       free     shared    buffers     cached
Mem:      20561624   14311016    6250608        480     340832    4794668
-/+ buffers/cache:    9175516   11386108
Swap:      2093052          0    2093052

I don't know anything about solr/yokozuna, but will search.

Martin Sumner

unread,
Dec 13, 2018, 12:10:40 PM12/13/18
to marco...@gmail.com, riak-...@googlegroups.com
I might be misinterpreting this, but my understanding is that the memory reading in the riak-admin top output is in kilobytes (http://erlang.org/doc/apps/observer/etop_ug.html).  So although unix top is saying riak is taking up 10GB, the erlang top is saying only 270MB.  This is odd.

I asked before about what backend you're using.  The only think I could think is that you're using the leveldb backend, and the memory allocated to leveldb is counting against the beam in unix top, but not being recorded in riak-admin top.  I don't know that this is the behaviour.  Other than that, the only other explanation is that there's more than one riak running, and you're not running riak-admin top against the right one.  Or I've completely misunderstood the stats output.

So what is your backend?  If it is leveldb what is the `leveldb.maximum_memory.percent` setting in your riak.conf file?  Wha isi your ring-size?  Roughly how many keys and objects do you have in your store?  What is the output of `riak-admin status`?

Marco Shaw

unread,
Dec 13, 2018, 12:26:03 PM12/13/18
to riak-...@googlegroups.com

This answers most of your questions (except keys and objects in the store, which I'll look up what that means/how to get it).

I believe I've removed everything "sensitive" so you'll see "nodeX" in places you might not expect.

leveldb.maximum_memory.percent = 70
ring_size = 128
storage_backend = leveldb

This means only *one RIAK* is running?  The setup here is that the 5 nodes are actually separate Ubuntu VMs.

root@node1:~# ps -ef|grep beam
riak      1406  1403 99 Dec12 pts/4    1-14:29:05 /usr/lib/riak/erts-5.10.3/bin/beam.smp -scl false -sfwi 500 -P 256000 -e 8192 -Q 65536 -A 64 -K true -W w -zdbbl 32768 -- -root /usr/lib/riak -progname riak -- -home /var/lib/riak -- -boot /usr/lib/riak/releases/2.0.6/riak -config /var/lib/riak/generated.configs/app.2018.12.12.05.23.09.config -setcookie riak -name riak@node1 -smp enable -vm_args /var/lib/riak/generated.configs/vm.2018.12.12.05.23.09.args -pa /usr/lib/riak/lib/basho-patches -- console


root@node1:~# riak-admin status
1-minute stats for 'riak@node1'
-------------------------------------------
connected_nodes : ['riak@node3','riak@node2',
                   'riak@node5','riak@node4']
consistent_get_objsize_100 : 0
consistent_get_objsize_95 : 0
consistent_get_objsize_99 : 0
consistent_get_objsize_mean : 0
consistent_get_objsize_median : 0
consistent_get_time_100 : 0
consistent_get_time_95 : 0
consistent_get_time_99 : 0
consistent_get_time_mean : 0
consistent_get_time_median : 0
consistent_gets : 0
consistent_gets_total : 0
consistent_put_objsize_100 : 0
consistent_put_objsize_95 : 0
consistent_put_objsize_99 : 0
consistent_put_objsize_mean : 0
consistent_put_objsize_median : 0
consistent_put_time_100 : 0
consistent_put_time_95 : 0
consistent_put_time_99 : 0
consistent_put_time_mean : 0
consistent_put_time_median : 0
consistent_puts : 0
consistent_puts_total : 0
converge_delay_last : 0
converge_delay_max : 0
converge_delay_mean : 0
converge_delay_min : 0
coord_redirs_total : 45039
counter_actor_counts_100 : 0
counter_actor_counts_95 : 0
counter_actor_counts_99 : 0
counter_actor_counts_mean : 0
counter_actor_counts_median : 0
cpu_avg1 : 410
cpu_avg15 : 581
cpu_avg5 : 525
cpu_nprocs : 299
dropped_vnode_requests_total : 0
executing_mappers : 0
gossip_received : 8
handoff_timeouts : 0
ignored_gossip_total : 0
index_fsm_active : 0
index_fsm_create : 0
index_fsm_create_error : 0
late_put_fsm_coordinator_ack : 0
leveldb_read_block_error : 0
list_fsm_active : 0
list_fsm_create : 0
list_fsm_create_error : 0
list_fsm_create_error_total : 0
list_fsm_create_total : 0
map_actor_counts_100 : 0
map_actor_counts_95 : 0
map_actor_counts_99 : 0
map_actor_counts_mean : 0
map_actor_counts_median : 0
mem_allocated : 14693687296
mem_total : 21055102976
memory_atom : 891825
memory_atom_used : 884493
memory_binary : 32288440
memory_code : 22945856
memory_ets : 3717360
memory_processes : 125946400
memory_processes_used : 124813656
memory_system : 142407856
memory_total : 268354256
node_get_fsm_active : 3
node_get_fsm_active_60s : 99
node_get_fsm_counter_objsize_100 : 0
node_get_fsm_counter_objsize_95 : 0
node_get_fsm_counter_objsize_99 : 0
node_get_fsm_counter_objsize_mean : 0
node_get_fsm_counter_objsize_median : 0
node_get_fsm_counter_siblings_100 : 0
node_get_fsm_counter_siblings_95 : 0
node_get_fsm_counter_siblings_99 : 0
node_get_fsm_counter_siblings_mean : 0
node_get_fsm_counter_siblings_median : 0
node_get_fsm_counter_time_100 : 0
node_get_fsm_counter_time_95 : 0
node_get_fsm_counter_time_99 : 0
node_get_fsm_counter_time_mean : 0
node_get_fsm_counter_time_median : 0
node_get_fsm_errors : 0
node_get_fsm_errors_total : 0
node_get_fsm_in_rate : 0
node_get_fsm_map_objsize_100 : 0
node_get_fsm_map_objsize_95 : 0
node_get_fsm_map_objsize_99 : 0
node_get_fsm_map_objsize_mean : 0
node_get_fsm_map_objsize_median : 0
node_get_fsm_map_siblings_100 : 0
node_get_fsm_map_siblings_95 : 0
node_get_fsm_map_siblings_99 : 0
node_get_fsm_map_siblings_mean : 0
node_get_fsm_map_siblings_median : 0
node_get_fsm_map_time_100 : 0
node_get_fsm_map_time_95 : 0
node_get_fsm_map_time_99 : 0
node_get_fsm_map_time_mean : 0
node_get_fsm_map_time_median : 0
node_get_fsm_objsize_100 : 7926
node_get_fsm_objsize_95 : 5669
node_get_fsm_objsize_99 : 7904
node_get_fsm_objsize_mean : 4152
node_get_fsm_objsize_median : 5031
node_get_fsm_out_rate : 0
node_get_fsm_rejected : 0
node_get_fsm_rejected_60s : 0
node_get_fsm_rejected_total : 0
node_get_fsm_set_objsize_100 : 0
node_get_fsm_set_objsize_95 : 0
node_get_fsm_set_objsize_99 : 0
node_get_fsm_set_objsize_mean : 0
node_get_fsm_set_objsize_median : 0
node_get_fsm_set_siblings_100 : 0
node_get_fsm_set_siblings_95 : 0
node_get_fsm_set_siblings_99 : 0
node_get_fsm_set_siblings_mean : 0
node_get_fsm_set_siblings_median : 0
node_get_fsm_set_time_100 : 0
node_get_fsm_set_time_95 : 0
node_get_fsm_set_time_99 : 0
node_get_fsm_set_time_mean : 0
node_get_fsm_set_time_median : 0
node_get_fsm_siblings_100 : 1
node_get_fsm_siblings_95 : 1
node_get_fsm_siblings_99 : 1
node_get_fsm_siblings_mean : 1
node_get_fsm_siblings_median : 1
node_get_fsm_time_100 : 8913
node_get_fsm_time_95 : 3331
node_get_fsm_time_99 : 8012
node_get_fsm_time_mean : 1837
node_get_fsm_time_median : 1558
node_gets : 102
node_gets_counter : 0
node_gets_counter_total : 0
node_gets_map : 0
node_gets_map_total : 0
node_gets_set : 0
node_gets_set_total : 0
node_gets_total : 180762
node_put_fsm_active : 2
node_put_fsm_active_60s : 96
node_put_fsm_counter_time_100 : 0
node_put_fsm_counter_time_95 : 0
node_put_fsm_counter_time_99 : 0
node_put_fsm_counter_time_mean : 0
node_put_fsm_counter_time_median : 0
node_put_fsm_in_rate : 3
node_put_fsm_map_time_100 : 0
node_put_fsm_map_time_95 : 0
node_put_fsm_map_time_99 : 0
node_put_fsm_map_time_mean : 0
node_put_fsm_map_time_median : 0
node_put_fsm_out_rate : 3
node_put_fsm_rejected : 0
node_put_fsm_rejected_60s : 0
node_put_fsm_rejected_total : 0
node_put_fsm_set_time_100 : 0
node_put_fsm_set_time_95 : 0
node_put_fsm_set_time_99 : 0
node_put_fsm_set_time_mean : 0
node_put_fsm_set_time_median : 0
node_put_fsm_time_100 : 39176
node_put_fsm_time_95 : 20226
node_put_fsm_time_99 : 34342
node_put_fsm_time_mean : 8199
node_put_fsm_time_median : 4372
node_puts : 62
node_puts_counter : 0
node_puts_counter_total : 0
node_puts_map : 0
node_puts_map_total : 0
node_puts_set : 0
node_puts_set_total : 0
node_puts_total : 112012
nodename : 'riak@node1'
object_counter_merge : 0
object_counter_merge_time_100 : 0
object_counter_merge_time_95 : 0
object_counter_merge_time_99 : 0
object_counter_merge_time_mean : 0
object_counter_merge_time_median : 0
object_counter_merge_total : 0
object_map_merge : 0
object_map_merge_time_100 : 0
object_map_merge_time_95 : 0
object_map_merge_time_99 : 0
object_map_merge_time_mean : 0
object_map_merge_time_median : 0
object_map_merge_total : 0
object_merge : 1
object_merge_time_100 : 31
object_merge_time_95 : 31
object_merge_time_99 : 31
object_merge_time_mean : 31
object_merge_time_median : 31
object_merge_total : 17251
object_set_merge : 0
object_set_merge_time_100 : 0
object_set_merge_time_95 : 0
object_set_merge_time_99 : 0
object_set_merge_time_mean : 0
object_set_merge_time_median : 0
object_set_merge_total : 0
pbc_active : 27
pbc_connects : 5
pbc_connects_total : 9778
pipeline_active : 0
pipeline_create_count : 0
pipeline_create_error_count : 0
pipeline_create_error_one : 0
pipeline_create_one : 0
postcommit_fail : 0
precommit_fail : 0
read_repairs : 0
read_repairs_counter : 0
read_repairs_counter_total : 0
read_repairs_fallback_notfound_count : undefined
read_repairs_fallback_notfound_one : undefined
read_repairs_fallback_outofdate_count : undefined
read_repairs_fallback_outofdate_one : undefined
read_repairs_map : 0
read_repairs_map_total : 0
read_repairs_primary_notfound_count : undefined
read_repairs_primary_notfound_one : undefined
read_repairs_primary_outofdate_count : 2420
read_repairs_primary_outofdate_one : 0
read_repairs_set : 0
read_repairs_set_total : 0
read_repairs_total : 1429
rebalance_delay_last : 0
rebalance_delay_max : 0
rebalance_delay_mean : 0
rebalance_delay_min : 0
rejected_handoffs : 2
riak_kv_vnodeq_max : 6446
riak_kv_vnodeq_mean : 257.84
riak_kv_vnodeq_median : 0
riak_kv_vnodeq_min : 0
riak_kv_vnodeq_total : 6446
riak_kv_vnodes_running : 25
riak_pipe_vnodeq_max : 0
riak_pipe_vnodeq_mean : 0.0
riak_pipe_vnodeq_median : 0
riak_pipe_vnodeq_min : 0
riak_pipe_vnodeq_total : 0
riak_pipe_vnodes_running : 25
ring_creation_size : 128
ring_members : ['riak@node3','riak@node4',
                'riak@node5','riak@node1','riak@node2']
ring_num_partitions : 128
ring_ownership : <<"[{'riak@node3',26},\n {'riak@node4',26},\n {'riak@node5',26},\n {'riak@node1',25},\n {'riak@node2',25}]">>
rings_reconciled : 0
rings_reconciled_total : 31
set_actor_counts_100 : 0
set_actor_counts_95 : 0
set_actor_counts_99 : 0
set_actor_counts_mean : 0
set_actor_counts_median : 0
skipped_read_repairs : 0
skipped_read_repairs_total : 0
storage_backend : riak_kv_eleveldb_backend
sys_driver_version : <<"2.2">>
sys_global_heaps_size : deprecated
sys_heap_type : private
sys_logical_processors : 2
sys_monitor_count : 475
sys_otp_release : <<"R16B02_basho8">>
sys_port_count : 70
sys_process_count : 1481
sys_smp_support : true
sys_system_architecture : <<"x86_64-unknown-linux-gnu">>
sys_system_version : <<"Erlang R16B02_basho8 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:64] [kernel-poll:true] [frame-pointer]">>
sys_thread_pool_size : 64
sys_threads_enabled : true
sys_wordsize : 8
vnode_counter_update : 0
vnode_counter_update_time_100 : 0
vnode_counter_update_time_95 : 0
vnode_counter_update_time_99 : 0
vnode_counter_update_time_mean : 0
vnode_counter_update_time_median : 0
vnode_counter_update_total : 0
vnode_get_fsm_time_100 : 9402
vnode_get_fsm_time_95 : 1279
vnode_get_fsm_time_99 : 6392
vnode_get_fsm_time_mean : 429
vnode_get_fsm_time_median : 259
vnode_gets : 351
vnode_gets_total : 589697
vnode_index_deletes : 0
vnode_index_deletes_postings : 297
vnode_index_deletes_postings_total : 454436
vnode_index_deletes_total : 0
vnode_index_reads : 37
vnode_index_reads_total : 82409
vnode_index_refreshes : 0
vnode_index_refreshes_total : 0
vnode_index_writes : 432
vnode_index_writes_postings : 287
vnode_index_writes_postings_total : 454619
vnode_index_writes_total : 701069
vnode_map_update : 0
vnode_map_update_time_100 : 0
vnode_map_update_time_95 : 0
vnode_map_update_time_99 : 0
vnode_map_update_time_mean : 0
vnode_map_update_time_median : 0
vnode_map_update_total : 0
vnode_put_fsm_time_100 : 13392
vnode_put_fsm_time_95 : 8516
vnode_put_fsm_time_99 : 11967
vnode_put_fsm_time_mean : 3097
vnode_put_fsm_time_median : 1539
vnode_puts : 437
vnode_puts_total : 704303
vnode_set_update : 0
vnode_set_update_time_100 : 0
vnode_set_update_time_95 : 0
vnode_set_update_time_99 : 0
vnode_set_update_time_mean : 0
vnode_set_update_time_median : 0
vnode_set_update_total : 0
disk : [{"/dev",10266840,1},
        {"/run",2056164,1},
        {"/",85277128,16},
        {"/sys/fs/cgroup",4,0},
        {"/run/lock",5120,0},
        {"/run/shm",10280812,0},
        {"/run/user",102400,0},
        {"/boot",240972,67}]
riak_auth_mods_version : <<"2.0.1-0-g9ae39fe">>
erlydtl_version : <<"0.7.0">>
riak_control_version : <<"2.0.2-0-g925e075">>
cluster_info_version : <<"2.0.3-0-g76c73fc">>
riak_jmx_version : <<"2.0.3-0-gc676607">>
riak_snmp_version : <<"2.0.2-0-g6ad5b4c">>
riak_repl_version : <<"2.0.1p1-0-gd98efc8">>
ranch_version : <<"0.4.0-p1">>
ebloom_version : <<"2.0.0">>
yokozuna_version : <<"2.0.3-0-g1894727">>
ibrowse_version : <<"4.0.1">>
riak_search_version : <<"2.0.3-0-g4414a6c">>
merge_index_version : <<"2.0.0-0-gb701dde">>
riak_kv_version : <<"2.0.3-0-gec9b41e">>
riak_api_version : <<"2.0.2p1-0-g1b694df">>
riak_pb_version : <<"2.0.0.17-0-g818e787">>
protobuffs_version : <<"0.8.1p5-0-gf88fc3c">>
riak_dt_version : <<"2.0.0p1-0-g5dd5307">>
sidejob_version : <<"2.0.0-0-gc5aabba">>
riak_pipe_version : <<"2.0.1-0-gc8fc8da">>
riak_core_version : <<"2.0.3-0-gcbca617">>
poolboy_version : <<"0.8.1p2-0-g84d836a">>
pbkdf2_version : <<"2.0.0-0-g7076584">>
eleveldb_version : <<"2.0.6-0-g8ddfc6a">>
clique_version : <<"0.2.5-0-g3af4db8">>
bitcask_version : <<"1.7.2">>
basho_stats_version : <<"1.0.3">>
webmachine_version : <<"1.10.6-0-g50c790d">>
mochiweb_version : <<"1.5.1p7">>
inets_version : <<"5.9.6">>
xmerl_version : <<"1.3.4">>
erlang_js_version : <<"1.3.0-0-g07467d8">>
mnesia_version : <<"4.10">>
snmp_version : <<"4.24.2">>
runtime_tools_version : <<"1.8.12">>
os_mon_version : <<"2.2.13">>
riak_sysmon_version : <<"2.0.0">>
exometer_core_version : <<"1.0.0-basho2-0-gb47a5d6">>
ssl_version : <<"5.3.1">>
public_key_version : <<"0.20">>
crypto_version : <<"3.1">>
asn1_version : <<"2.0.3">>
sasl_version : <<"2.3.3">>
lager_version : <<"2.0.3">>
goldrush_version : <<"0.1.6">>
compiler_version : <<"4.9.3">>
syntax_tools_version : <<"1.6.11">>
stdlib_version : <<"1.19.3">>
kernel_version : <<"2.16.3">>

Marco Shaw

unread,
Dec 13, 2018, 12:37:35 PM12/13/18
to riak-...@googlegroups.com
Does this potentially mean anything?

One particular index appears in the first two sections with a "--", but not at all in the "keys repaired" section.

All of the other indexes seem to be present in each of the 3 sections are have various values (but not "--").

root@node1:/etc/riak/data# riak-admin aae-status

 

================================== Exchanges ==================================

Index                                              Last (ago)    All (ago)

-------------------------------------------------------------------------------

776422744832042175295707567380525354192214163456   --            --

 

================================ Entropy Trees ================================

Index                                              Built (ago)

-------------------------------------------------------------------------------

776422744832042175295707567380525354192214163456   --

 

================================ Keys Repaired ================================

Index                                                Last      Mean      Max

-------------------------------------------------------------------------------

***Index number above does not show here***


/etc/riak/data/anti_entropy:

This *3456 appears to have "old files" compared to other directories with .sst files, sample:

776422744832042175295707567380525354192214163456/sst_0:
total 13120
-rw-r--r-- 1 riak riak 3948045 Nov 28 07:30 000061.sst
-rw-r--r-- 1 riak riak 4734785 Nov 28 16:18 000063.sst
-rw-r--r-- 1 riak riak 4751247 Nov 29 01:24 000065.sst

776422744832042175295707567380525354192214163456/sst_1:
total 12968
-rw-r--r-- 1 riak riak 6718394 Nov 26 01:38 000050.sst
-rw-r--r-- 1 riak riak 6556382 Nov 27 18:35 000059.sst


Martin Sumner

unread,
Dec 13, 2018, 1:03:13 PM12/13/18
to marco...@gmail.com, riak-...@googlegroups.com
leveldb has the default configuration to try and take 70% of the memory on the machine.  So as long as you have more data on disk than 70% of your memory - you will see memory usage grow up to 70%.  So the fact that memory is growing in that direction shouldn't in itself be cause for alarm.

If there is another process of the machine (other than riak) taking a lot of memory, then obviously it would be wrong for leveldb to try and take 70% - so you need to check that out.  You can reduce thee percentage of memory leveldb uses to a lower level in this case, this will impact performance, but any excess memory will not be wasted, it should be used by the pagecache.

If you have nothing else taking memory on the machine, and it runs out of memory still .... thne clearly there is something wrong with the algorithm leveldb is using to allocate memory.

Are you actually seeing out of memory errors in the riak logs?  As I say, Riak increasingly using more memory (up to about 70%) is normal in the configuration, so that might just be a red herring.

The actual error you reported initially was a low-level issue in leveldb - but this may just be genuine file corruption. In which case then wiping the AAE partition, and waiting for it to rebuild would probably do the job.  Until it has been wiped, and the rebuild time has come around again for it - then you will see that partition in an odd state in the riak-admin aae-status

--
You received this message because you are subscribed to the Google Groups "riak-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to riak-users+...@googlegroups.com.
To post to this group, send email to riak-...@googlegroups.com.

Marco Shaw

unread,
Dec 13, 2018, 2:05:29 PM12/13/18
to riak-...@googlegroups.com
I decided to blow away anti_entropy because:
1. This has been running for 15 months with no issues (other than another time I have high CPU on a node and delete anti_entropy, but the error was different in the logs).
2. Only one node was behaving like this after I did some OS patches and restarted it recently.
3. All of the logs on that single server posting the original error, over and over and over.
4. beam.smp was definitely pushing the memory to the limit (or at least the size of the memory for a single process, I didn't investigate that part):
kern.log:Dec 11 22:49:06 node1 kernel: [487657.715405] Out of memory: Kill process 1410 (beam.smp) score 984 or sacrifice child

Fingers crossed as it "rebuilds"...

Reply all
Reply to author
Forward
0 new messages