Hyperdex coordinator unexpectedly drop out from the cluster without any operation

Hao Yuan

unread,

Mar 21, 2016, 6:50:46 AM3/21/16

to hyperdex-discuss

Hi all,

Recently, I did some research with the Hyperdex but met some issues about the cluster of the coordinators. I will introduce the whole story and provide some detail information. Quite long, but please help me!

Background:

Want to set up a hyperdex cluster to support quick search. The cluster contains 3 coordinators and 5 daemons.

What block me:

The coordinator cluster often hang out because two nodes unexpectedly drop out from the cluster after running some time without any put/get/search operation. And the log file show something but not enough for me to find a solution.

Machine:

All machine are different virtual PC on Cloud.

Operation system: Cent OS

Coordinator: 4 CPU, MEM 16G, DISK 100G

Daemons:4 CPU,MEM 16G,DISK 1000G

All with latest Hyperdex version.

Data:

The original data file is about 2G, 16 columns.

Steps:

set up cluster of 3 coordinators:

Coordinator 1:

hyperdex coordinator -d -l {IP1} -p 1985 -D {xxx/coord} -L {xxx/log}

Coordinator 2:

hyperdex coordinator -d --listen={IP2} --listen-port=1985 --connect={IP1} --connect-port=1985 -D {xxx/coord} -L {xxx/log}

Coordinator 3:

hyperdex coordinator -d --listen={IP3} --listen-port=1985 --connect={IP1} --connect-port=1985 -D {xxx/coord} -L {xxx/log}

Set up 5 daemons:

hyperdex daemon -d --listen=${nodeIP} --listen-port=2012 --coordinator={IP1} --coordinator-port=1985 --data={node/data} --log={node/data}

Question 1:

During I set up the whole cluster, I used the command:

hyperdex show-config -h {IP1} -p 1985

to check the state of the whole cluster. But it didn’t always work. Before I add the daemons in the cluster, sometime it showed:

Cluster:0

Flag:0

Version:0

Sometime it showed:

Cluster: correct ID

Flag:0

Version:1

The results seems randomly when I keep executing the command.

After I successfully added any daemons in the whole cluster, the command worked well like:

Cluster: correct ID

Version:8

Flag:0

Server 1

Server 2

Server 3

Server 4

Server 5

Can anyone help to explain it to me?

Data importing:

I set up a separately a server to do data importing and put/get/search operation.

The data could be successfully imported to the whole cluster. And the put/get/search operation also could be done.

The space created like below:

hyperdex add-space -h IP1 -p 1985 <<EOF

space testSpace

key int id

attributes a1,a2,...a15

subspace a1,a2,a3

create 8 partitions

tolerate 2 failures

EOF

What I met is that:

After I done my operation and went home, the second and the third coordinator unexpectedly dropped out the cluster. And the two machine have a very very high disk read operation so that the system halt.( it happen twice, so I write this to ask help)

When I did put/get/search operations, I got some exception because the data types weren’t correct. But I don’t think it will cause the system halt just for present the whole story.

Question 2: what happen and how to solve?

It seems that there are some wrong with Replicant, the log and the data in coordinator show the Replicant seems not work correct.

I prepare another analysis document for detail, Could anyone can read it and give me some advise?

Did anyone meet similar issue when setup the cluster of coordinator?

I will appreciate it if anyone can give some advise. At the same time I would try more to explore what happen.

Thanks and Best Regards,

Hao

Detail analysis.doc

Hao Yuan

unread,

Mar 28, 2016, 8:20:05 AM3/28/16

to hyperdex-discuss

Hi all,

I tried to reproduce this issue but failed. Now the cluster of the coordinators is stable. But I really met the problem twice. Anyway, I would like to continue my experiment. If anyone meet the unstable situation, please tell me.

Best Regards,
Hao

Robert Escriva

unread,

Mar 28, 2016, 11:17:34 AM3/28/16

to hyperdex...@googlegroups.com

I've improved the coordinator since the last release. Much of the
instability you describe has been addressed.

-Robert

Hao Yuan

unread,

Mar 29, 2016, 4:42:01 AM3/29/16

to hyperdex-discuss

Hi Robert,

Thanks for your reply!

But unfortunately, the coordinator cluster down again without any operation today. Can you describe more for your fixing? Which component and which version? It will be much helpful.

Actually, I installed hyperdex from source as the instruction listed on your site(hyperleveldb-1.2.2,hyperdex-1.8.1). And I see that you added tow new lines in file hyperspace.cc in Github. Hope this two great lines can solve my prolem :D.

BTW,I just doubt the instability is caused by the sync between coordinators from the log. I will download the latest version and have a try again. But I will appreciate it if you can give me more information.

Best Regards,

Hao

在 2016年3月28日星期一 UTC+8下午11:17:34，Robert Escriva写道：

Hao Yuan

unread,

Mar 29, 2016, 4:56:21 AM3/29/16

to hyperdex-discuss

Hi Robert,

Can you give me some suggestion about how to recover the cluster?

Now one coordinator still work. One drop out and the process is stop unexpectedly.The last one is halted because high IO request and I must to restart it.

Best Regards,

Hao

在 2016年3月28日星期一 UTC+8下午11:17:34，Robert Escriva写道：

On Mon, Mar 28, 2016 at 05:20:04AM -0700, Hao Yuan wrote:

Hao Yuan

unread,

Apr 18, 2016, 7:43:40 AM4/18/16

to hyperdex-discuss

Hi Robert,

Now the cluster is stable after I use the master branch code from github.

Thanks,

Hao

在 2016年3月28日星期一 UTC+8下午11:17:34，Robert Escriva写道：

On Mon, Mar 28, 2016 at 05:20:04AM -0700, Hao Yuan wrote:

Reply all

Reply to author

Forward