nodetool repair crashed

120 views
Skip to first unread message

Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 26, 2019, 7:52:26 AM9/26/19
to ScyllaDB users
we run test
6 machines with 7t ssd
and 2t data on it

and complite quest in



we replace .11 to .14 node with

replace_address_first_boot: 172.24.221.11

UN  172.24.221.10  1.8 TB     256          77.3%             
UN  172.24.221.12  1.53 TB    256          73.7%             
UN  172.24.221.13  1.87 TB    256          74.8%             
UN  172.24.221.14  8.82 GB    256          74.2%             

we kill 
UN  172.24.221.11  
 all ok BUT 



but nodetool repair crashed

Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 15] rpc - client 172.24.221.11: fail to connect: Connection refused
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 19] repair - Checksum of range [-9167189631503328870, -9167077041512644608) on 172.24.221.11 failed: seastar::rpc::closed_error (connection is closed)
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 19] repair - Checksum or sync of partial range failed
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 19] rpc - client 172.24.221.11: fail to connect: Connection refused
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 23] repair - Checksum of range (-8792884207473498521, -8792827912478156390) on 172.24.221.11 failed: seastar::rpc::closed_error (connection is closed)
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 23] repair - Checksum or sync of partial range failed
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 23] rpc - client 172.24.221.11: fail to connect: Connection refused
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 23] repair - Checksum of range [-8788436902841470156, -8788324312850785894) on 172.24.221.11 failed: seastar::rpc::closed_error (connection is closed)
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 23] repair - Checksum or sync of partial range failed
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 23] rpc - client 172.24.221.11: fail to connect: Connection refused
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 39] repair - Checksum of range [-8786635462990521958, -8786522872999837696) on 172.24.221.11 failed: seastar::rpc::closed_error (connection is closed)



Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 26, 2019, 7:55:04 AM9/26/19
to ScyllaDB users
we kill this node 

why process continue connect to it after restart 
in connect string we have 

2019/09/26 14:52:29 logger.go:27: Found invalid peer '[HostInfo connectAddress="172.24.221.11" peer="172.24.221.11" rpc_address="<nil>" broadcast_address="<nil>" preferred_ip="<nil>" connect_addr="172.24.221.11" connect_addr_source="connect_address" port=9042 data_centre="" rack="" host_id="00000000-0000-0000-0000-000000000000" version="v0.0.0" state=UP num_tokens=256]' Likely due to a gossip or snitch issue, this host will be ignored


четверг, 26 сентября 2019 г., 14:52:26 UTC+3 пользователь Vasily Popkov написал:

Asias He

<asias@scylladb.com>
unread,
Sep 26, 2019, 8:03:46 AM9/26/19
to ScyllaDB users
What is the ip address of plck-pixeldb-05? I assume it is another node. If you kill 172.24.221.11 while repair runs and repairs with node 172.24.221.11, it is expected to see the failure of repair.

By "repair crashed", I guess you mean repair failed.


 


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/7ddf8a25-de4e-4c29-a72b-c08dccca447e%40googlegroups.com.


--
Asias

Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 26, 2019, 8:09:55 AM9/26/19
to ScyllaDB users
.11 it's pixeldb-02
. 14 it's pixeldb-05

Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 26, 2019, 8:13:39 AM9/26/19
to ScyllaDB users
please read first message please.


четверг, 26 сентября 2019 г., 15:03:46 UTC+3 пользователь Asias He написал:


To unsubscribe from this group and stop receiving emails from it, send an email to scyllad...@googlegroups.com.


--
Asias
IMG_20190926_143551.jpg

Asias He

<asias@scylladb.com>
unread,
Sep 26, 2019, 8:20:26 AM9/26/19
to ScyllaDB users
On Thu, Sep 26, 2019 at 8:13 PM Vasily Popkov <p1dl...@gmail.com> wrote:
please read first message please.

which message?
 
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/1d255ac0-bbf1-4b8b-b21a-2005ad4e5d9a%40googlegroups.com.


--
Asias

Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 26, 2019, 8:39:44 AM9/26/19
to ScyllaDB users
https://groups.google.com/d/msg/scylladb-users/Yfse6ewtCcs/Ac6p9kMZCAAJ

четверг, 26 сентября 2019 г., 15:20:26 UTC+3 пользователь Asias He написал:


Asias He

<asias@scylladb.com>
unread,
Sep 26, 2019, 8:52:22 AM9/26/19
to ScyllaDB users

Of course I read this message. Ask me to read again does not help. It is not clear when  you run repair, during the replace operation or after it is done. The original message also does not say on which node you run the repair.


To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/f2bc73ab-b7fa-445c-9288-694a018470af%40googlegroups.com.

Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 26, 2019, 9:11:51 AM9/26/19
to ScyllaDB users
i run repair on new node
how it talks in 

all step by step


четверг, 26 сентября 2019 г., 15:52:22 UTC+3 пользователь Asias He написал:

Asias He

<asias@scylladb.com>
unread,
Sep 27, 2019, 3:03:16 AM9/27/19
to ScyllaDB users
On Thu, Sep 26, 2019 at 9:11 PM Vasily Popkov <p1dl...@gmail.com> wrote:

It does not sense because after the replace operation is done. The new node will not see the old node.

```
Sep 26 14:42:56 plck-pixeldb-05 scylla: [shard 19] repair - Checksum of range [-9167189631503328870, -9167077041512644608) on 172.24.221.11 failed: seastar::rpc::closed_error (connection is closed)
```

The log suggests plck-pixeldb-05(new node) tries to repair with the (old node which is replaced).

Please provide the output of

```
nodetool gossipinfo
nodetool ring

```
from all nodes.

If you run repair again on the new node. Does repair succeed?

--
Asias

Vasily Popkov

<p1dl0.vp@gmail.com>
unread,
Sep 27, 2019, 7:08:31 AM9/27/19
to ScyllaDB users


we clean new node and  re run repair 

and process is very slow

on a foundation db repair is more faster

1 day repair we have 

--  Address        Load       Tokens    Owns (effective)  Host ID                               Rack
UN  172.24.221.10  2.61 TB    256          75.8%             b7a00907-6054-4ee1-83c7-a023f74a1894  rack1
UN  172.24.221.12  2.29 TB    256          71.8%             2cdfe372-0b3f-49a8-b16b-148efc29ebad  rack1
UN  172.24.221.13  2.69 TB    256          76.9%             fd34f5b4-6a09-4020-bd58-d5d720e74867  rack1
UN  172.24.221.14  1.79 TB    256          75.5%             4d083b8c-83ab-4003-84d2-0a6508df01be  rack1

test writes 10k op/s

repair in progress 
very very slow 

how to speed up the process.

we have 10g net and 2g/s ssd speed 
why rebalance is so slowww???




пятница, 27 сентября 2019 г., 10:03:16 UTC+3 пользователь Asias He написал:
Reply all
Reply to author
Forward
0 new messages