All machines are installed with the latest code from github.
The file system is ext4 and I have done tuning as the guidance for every machine.
Cluster setup:
Coordinator:
hyperdex coordinator -d -l 10.139.98.119 -p 1982 -D /alidata1/admin/hyperdex/coord/data/ -L /alidata1/admin/hyperdex/coord/log/
Daemons:
hyperdex daemon -d -l 10.253.9.183 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/
Space creating script:
hyperdex add-space -h 10.139.98.119 -p 1982 << EOF
space testSpace
key int id
attributes a1, a2 ,a3 ,a4 ,a5 ,a6 ,a7 ,a8
subspace a1, a2, a3
create 24 partitions
tolerate 1 failures
EOF
Data generated by single process:
python generateData.py 10.139.98.119 1982 testSpace 1 100000 &
Report log:
total cost:
0:02:17.793744
total write cost:
0:02:14.848735
It is about 700 record/s using single python process
Data generated by 15 processes:
#!/bin/sh
for (( i=0 ; i<15; i++ ))
do
echo $i
min=$[$i*20000 + 1 + 200000 ]
max=$[$min + 20000 - 1]
python generateData.py 10.139.98.119 1982 testSpace $min $max &
done
The 300,000 rows of data can be putted in to cluster about 30 second
But when I added a new node to the cluster or start a new cluster as below:
Cluster setup:
Coordinator:
hyperdex coordinator -d -l 10.139.98.119 -p 1982 -D /alidata1/admin/hyperdex/coord/data/ -L /alidata1/admin/hyperdex/coord/log/
Daemons:
hyperdex daemon -d -l 10.253.9.183 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/
hyperdex daemon -d -l 10.253.101.10 -p 2012 -c 10.139.98.119 -P 1982 -D /alidata1/admin/hyperdex/daemon/data/ -L /alidata1/admin/hyperdex/daemon/log/
Space creating is the same.
Data generated by single process:
python generateData.py 10.139.98.119 1982 testSpace 1 100000 &
Report log:
total cost:
0:12:47.515368
total write cost:
0:12:43.911953
It is just about 130 record/s by single python process
Data generated by 15 processs:
#!/bin/sh
for (( i=0 ; i<15; i++ ))
do
echo $i
min=$[$i*20000 + 1 + 200000 ]
max=$[$min + 20000 - 1]
python generateData.py 10.139.98.119 1982 testSpace $min $max &
done
But! This can’t be done! All the python process are switched out by CPU and waiting for the process by cluster.
admin 16660 1 0 17:34 pts/0 00:00:00 python generateData.py 10.139.98.119 1982 testSpace 200001 220000
......
admin 16674 1 0 17:34 pts/0 00:00:00 python generateData.py 10.139.98.119 1982 testSpace 480001 500000
When I try to count the number of my space like below:
>>> c.count('testSpace',{})
It can’t return any number but only wait.
The netstat is ok:
And the CPU usage is low and the disk IO are also low:
A daemon server:
The bandwidth between these machine are about 500Mbits/sec
From the python side ,it always wait for the completion for a put operation.
From the hyperdex side, I don’t see any block issue.
If I use aync call to put the data, the same situation will happen.
python generateData_async.py 10.139.98.119 1982 testSpace 1 100000 &
You can see that after I remove the space ,I will get below exception info.
I try to check the application log printed by hyperdex, but log look fine:
The coordinaotr:
The daemon:
I also do similar experiment with 3 coordinator and 8 daemon server. The result is same. Can anyone give me some suggestion?
Thanks,
Hao