writing is very slow and multi-writing get blocked when the number of daemon server grows

48 views
Skip to first unread message

Hao Yuan

unread,
Apr 25, 2016, 6:18:32 AM4/25/16
to hyperdex-discuss
Hi Robert,hi all,

I really meet a big problem when I try to do put operations based on a multi nodes hyperdex cluster.

The writing is very slow and the multi-writing will get blocked. Once this happened, the count, search, other operation will be blocked. It is too bad for me. I have tried serverl times and it always happen. 

The detail steps:

All machines are installed with the latest code from github.

 

The file system is ext4 and I have done tuning as the guidance for every machine. 

 

Cluster setup:

 

Coordinator:

 hyperdex coordinator -d -l 10.139.98.119 -p 1982 -D /alidata1/admin/hyperdex/coord/data/ -L /alidata1/admin/hyperdex/coord/log/


Daemons:

 hyperdex daemon -d -l 10.253.9.183 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/ 

 

Space creating script:

hyperdex add-space -h 10.139.98.119 -p 1982 << EOF

        space testSpace

        key int id

        attributes a1, a2 ,a3 ,a4 ,a5 ,a6 ,a7 ,a8

        subspace a1, a2, a3

        create 24 partitions

        tolerate 1 failures

EOF

 

Data generated by single process:

python generateData.py 10.139.98.119 1982 testSpace 1 100000 &

 

Report log:

total cost:

0:02:17.793744

total write cost:

0:02:14.848735

 

It is about 700 record/s using single python process

Data generated by 15 processes:

#!/bin/sh

 

for (( i=0 ; i<15; i++ ))

do

    echo $i

    min=$[$i*20000 + 1 + 200000 ]

    max=$[$min + 20000 - 1]

    python generateData.py 10.139.98.119 1982 testSpace $min $max &

done

 

The 300,000 rows of data can be putted in to cluster about 30 second

 

But when I added a new node to the cluster or start a new cluster as below:

Cluster setup:

 

Coordinator:

 hyperdex coordinator -d -l 10.139.98.119 -p 1982 -D /alidata1/admin/hyperdex/coord/data/ -L /alidata1/admin/hyperdex/coord/log/


Daemons:

 hyperdex daemon -d -l 10.253.9.183 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/ 

hyperdex daemon -d -l 10.253.101.10 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/ 

 

Space creating is the same.

Data generated by single process:

python generateData.py 10.139.98.119 1982 testSpace 1 100000 &

 

Report log:

total cost:

0:12:47.515368

total write cost:

0:12:43.911953

 

It is just about 130 record/s by single python process

 

Data generated by 15 processs:

#!/bin/sh

 

for (( i=0 ; i<15; i++ ))

do

    echo $i

    min=$[$i*20000 + 1 + 200000 ]

    max=$[$min + 20000 - 1]

    python generateData.py 10.139.98.119 1982 testSpace $min $max &

done

 

But! This cant be done! All the python process are switched out by CPU and waiting for the process by cluster.

 

admin    16660     1  0 17:34 pts/0    00:00:00 python generateData.py 10.139.98.119 1982 testSpace 200001 220000

......

admin    16674     1  0 17:34 pts/0    00:00:00 python generateData.py 10.139.98.119 1982 testSpace 480001 500000

 

When I try to count the number of my space like below:

>>> c.count('testSpace',{})

It cant return any number but only wait.

 

The netstat is ok:


 

And the CPU usage is low and the disk IO are also low:

A daemon server:



 

The bandwidth between these machine are about 500Mbits/sec


 

From the python side ,it always wait for the completion for a put operation.

From the hyperdex side, I dont see any block issue.

 

If I use aync call to put the data, the same situation will happen.


 python generateData_async.py 10.139.98.119 1982 testSpace 1 100000 &

 

You can see that after I remove the space ,I will get below exception info.


 


I try to check the application log printed by hyperdex, but log look fine:

The coordinaotr:


The daemon:

 

I also do similar experiment with 3 coordinator and 8 daemon server. The result is same. Can anyone give me some suggestion?


Thanks,

Hao

generateData.py
generateData_async.py

Hao Yuan

unread,
Apr 26, 2016, 4:20:01 AM4/26/16
to hyperdex-discuss
I also set up a cluster with 3 coordinator and 8 daemon server.

For each coordinator, I have 2 process to do 30,000 times put operations. But some put operations still get stuck.


python generateData.py 10.253.9.201 1982 testSpace 1 30000 &
python generateData.py 10.253.9.201 1982 testSpace 30000 60000&
python generateData.py 10.253.101.72 1982 testSpace 60000 90000&
python generateData.py 10.253.101.72 1982 testSpace 90000 120000&
python generateData.py 10.253.9.175 1982 testSpace 120000 150000&
python generateData.py 10.253.9.175 1982 testSpace 150000 180000&

Two process get stuck.
[admin@iZ239smh8cgZ script]$ ps -ef|grep python
admin     8209  8004  0 16:03 pts/1    00:00:02 python generateData.py 10.253.101.72 1982 testSpace 90000 120000
admin     8212  8004  0 16:04 pts/1    00:00:01 python generateData.py 10.253.9.175 1982 testSpace 150000 180000
admin     8222  8004  0 16:11 pts/1    00:00:00 grep --color=auto python

and the count of the data keep unchanged:
>>> c.count('testSpace',{})
141494L
>>> c.count('testSpace',{})
141494L




Hao Yuan

unread,
Apr 28, 2016, 6:08:42 AM4/28/16
to hyperdex-discuss
the operation system is CentOS 7.0.1406
Reply all
Reply to author
Forward
0 new messages