writing is very slow and multi-writing get blocked when the number of daemon server grows

Hao Yuan

unread,

Apr 25, 2016, 6:18:32 AM4/25/16

to hyperdex-discuss

Hi Robert,hi all,

I really meet a big problem when I try to do put operations based on a multi nodes hyperdex cluster.

The writing is very slow and the multi-writing will get blocked. Once this happened, the count, search, other operation will be blocked. It is too bad for me. I have tried serverl times and it always happen.

The detail steps:

All machines are installed with the latest code from github.

The file system is ext4 and I have done tuning as the guidance for every machine.

Cluster setup:

Coordinator:

hyperdex coordinator -d -l 10.139.98.119 -p 1982 -D /alidata1/admin/hyperdex/coord/data/ -L /alidata1/admin/hyperdex/coord/log/

Daemons:

hyperdex daemon -d -l 10.253.9.183 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/

Space creating script:

hyperdex add-space -h 10.139.98.119 -p 1982 << EOF

space testSpace

key int id

attributes a1, a2 ,a3 ,a4 ,a5 ,a6 ,a7 ,a8

subspace a1, a2, a3

create 24 partitions

tolerate 1 failures

EOF

Data generated by single process:

python generateData.py 10.139.98.119 1982 testSpace 1 100000 &

Report log:

total cost:

0:02:17.793744

total write cost:

0:02:14.848735

It is about 700 record/s using single python process

Data generated by 15 processes:

#!/bin/sh

for (( i=0 ; i<15; i++ ))

do

echo $i

min=$[$i*20000 + 1 + 200000 ]

max=$[$min + 20000 - 1]

python generateData.py 10.139.98.119 1982 testSpace $min $max &

done

The 300,000 rows of data can be putted in to cluster about 30 second

But when I added a new node to the cluster or start a new cluster as below:

Cluster setup:

Coordinator:

hyperdex coordinator -d -l 10.139.98.119 -p 1982 -D /alidata1/admin/hyperdex/coord/data/ -L /alidata1/admin/hyperdex/coord/log/

Daemons:

hyperdex daemon -d -l 10.253.9.183 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/

hyperdex daemon -d -l 10.253.101.10 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/

Space creating is the same.

Data generated by single process:

python generateData.py 10.139.98.119 1982 testSpace 1 100000 &

Report log:

total cost:

0:12:47.515368

total write cost:

0:12:43.911953

It is just about 130 record/s by single python process

Data generated by 15 processs:

#!/bin/sh

for (( i=0 ; i<15; i++ ))

do

echo $i

min=$[$i*20000 + 1 + 200000 ]

max=$[$min + 20000 - 1]

python generateData.py 10.139.98.119 1982 testSpace $min $max &

done

But! This can’t be done! All the python process are switched out by CPU and waiting for the process by cluster.

admin 16660 1 0 17:34 pts/0 00:00:00 python generateData.py 10.139.98.119 1982 testSpace 200001 220000

......

admin 16674 1 0 17:34 pts/0 00:00:00 python generateData.py 10.139.98.119 1982 testSpace 480001 500000

When I try to count the number of my space like below:

>>> c.count('testSpace',{})

It can’t return any number but only wait.

The netstat is ok:

And the CPU usage is low and the disk IO are also low:

A daemon server:

The bandwidth between these machine are about 500Mbits/sec

From the python side ,it always wait for the completion for a put operation.

From the hyperdex side, I don’t see any block issue.

If I use aync call to put the data, the same situation will happen.

python generateData_async.py 10.139.98.119 1982 testSpace 1 100000 &

You can see that after I remove the space ,I will get below exception info.

I try to check the application log printed by hyperdex, but log look fine:

The coordinaotr:

The daemon:

I also do similar experiment with 3 coordinator and 8 daemon server. The result is same. Can anyone give me some suggestion?

Thanks,

Hao

generateData.py

generateData_async.py

Hao Yuan

unread,

Apr 26, 2016, 4:20:01 AM4/26/16

to hyperdex-discuss

I also set up a cluster with 3 coordinator and 8 daemon server.

For each coordinator, I have 2 process to do 30,000 times put operations. But some put operations still get stuck.

python generateData.py 10.253.9.201 1982 testSpace 1 30000 &

python generateData.py 10.253.9.201 1982 testSpace 30000 60000&

python generateData.py 10.253.101.72 1982 testSpace 60000 90000&

python generateData.py 10.253.101.72 1982 testSpace 90000 120000&

python generateData.py 10.253.9.175 1982 testSpace 120000 150000&

python generateData.py 10.253.9.175 1982 testSpace 150000 180000&

Two process get stuck.

[admin@iZ239smh8cgZ script]$ ps -ef|grep python

admin 8209 8004 0 16:03 pts/1 00:00:02 python generateData.py 10.253.101.72 1982 testSpace 90000 120000

admin 8212 8004 0 16:04 pts/1 00:00:01 python generateData.py 10.253.9.175 1982 testSpace 150000 180000

admin 8222 8004 0 16:11 pts/1 00:00:00 grep --color=auto python

and the count of the data keep unchanged:

>>> c.count('testSpace',{})

141494L

>>> c.count('testSpace',{})

141494L

Hao Yuan

unread,

Apr 28, 2016, 6:08:42 AM4/28/16

to hyperdex-discuss

the operation system is CentOS 7.0.1406

Reply all

Reply to author

Forward