Background:
Want to set up a hyperdex cluster to support quick search. The cluster contains 3 coordinators and 5 daemons.
What block me:
The coordinator cluster often hang out because two nodes unexpectedly drop out from the cluster after running some time without any put/get/search operation. And the log file show something but not enough for me to find a solution.
Machine:
All machine are different virtual PC on Cloud.
Operation system: Cent OS
Coordinator: 4 CPU, MEM 16G, DISK 100G
Daemons:4 CPU,MEM 16G,DISK 1000G
All with latest Hyperdex version.
Data:
The original data file is about 2G, 16 columns.
Steps:
set up cluster of 3 coordinators:
Coordinator 1:
hyperdex coordinator -d -l {IP1} -p 1985 -D {xxx/coord} -L {xxx/log}
Coordinator 2:
hyperdex coordinator -d --listen={IP2} --listen-port=1985 --connect={IP1} --connect-port=1985 -D {xxx/coord} -L {xxx/log}
Coordinator 3:
hyperdex coordinator -d --listen={IP3} --listen-port=1985 --connect={IP1} --connect-port=1985 -D {xxx/coord} -L {xxx/log}
Set up 5 daemons:
hyperdex daemon -d --listen=${nodeIP} --listen-port=2012 --coordinator={IP1} --coordinator-port=1985 --data={node/data} --log={node/data}
Question 1:
During I set up the whole cluster, I used the command:
hyperdex show-config -h {IP1} -p 1985
to check the state of the whole cluster. But it didn’t always work. Before I add the daemons in the cluster, sometime it showed:
Cluster:0
Flag:0
Version:0
Sometime it showed:
Cluster: correct ID
Flag:0
Version:1
The results seems randomly when I keep executing the command.
After I successfully added any daemons in the whole cluster, the command worked well like:
Cluster: correct ID
Version:8
Flag:0
Server 1
Server 2
Server 3
Server 4
Server 5
Can anyone help to explain it to me?
Data importing:
I set up a separately a server to do data importing and put/get/search operation.
The data could be successfully imported to the whole cluster. And the put/get/search operation also could be done.
The space created like below:
hyperdex add-space -h IP1 -p 1985 <<EOF
space testSpace
key int id
attributes a1,a2,...a15
subspace a1,a2,a3
create 8 partitions
tolerate 2 failures
EOF
What I met is that:
After I done my operation and went home, the second and the third coordinator unexpectedly dropped out the cluster. And the two machine have a very very high disk read operation so that the system halt.( it happen twice, so I write this to ask help)
When I did put/get/search operations, I got some exception because the data types weren’t correct. But I don’t think it will cause the system halt just for present the whole story.
Question 2: what happen and how to solve?
It seems that there are some wrong with Replicant, the log and the data in coordinator show the Replicant seems not work correct.
I prepare another analysis document for detail, Could anyone can read it and give me some advise?
Did anyone meet similar issue when setup the cluster of coordinator?
I will appreciate it if anyone can give some advise. At the same time I would try more to explore what happen.
Thanks and Best Regards,
Hao
Hi Robert,
Thanks for your reply!
But unfortunately, the coordinator cluster down again without any operation today. Can you describe more for your fixing? Which component and which version? It will be much helpful.
Actually, I installed hyperdex from source as the instruction listed on your site(hyperleveldb-1.2.2,hyperdex-1.8.1). And I see that you added tow new lines in file hyperspace.cc in Github. Hope this two great lines can solve my prolem :D.
BTW,I just doubt the instability is caused by the sync between coordinators from the log. I will download the latest version and have a try again. But I will appreciate it if you can give me more information.
Best Regards,
Hao
On Mon, Mar 28, 2016 at 05:20:04AM -0700, Hao Yuan wrote:
Now the cluster is stable after I use the master branch code from github.
On Mon, Mar 28, 2016 at 05:20:04AM -0700, Hao Yuan wrote: