Re: hbal: Warning: cluster has inconsistent data:

278 views
Skip to first unread message

Iustin Pop

unread,
Sep 28, 2010, 2:18:02 PM9/28/10
to gan...@googlegroups.com
On Tue, Sep 28, 2010 at 02:55:07PM -0300, Miguel Di Ciurcio Filho wrote:
> Hello folks,
>
> I'm experimenting with Ganeti and hbal, but right now I'm stuck with
> this warning:
>
> root@node01:~# hbal -m colossus
> Warning: cluster has inconsistent data:
> - node quake.xxx is missing -2967 MB ram and 23 GB disk
>
> Loaded 3 nodes, 9 instances
> Initial check done: 0 bad nodes, 0 bad instances.
> Initial score: 0.31727619
> Trying to minimize the CV...
> No solution found
> Solution length=0
>
> root@node01:~# gnt-node list
> Node DTotal DFree MTotal MNode MFree Pinst Sinst
> node01.xxx 931.5G 869.8G 7.8G 564M 7.3G 3 3
> node03.xxx 2.7T 2.7T 15.7G 860M 14.9G 3 3
> quake.xxx 232.7G 148.2G 3.8G 737M 3.0G 3 3

Not-so nice alignment here :)

> root@quake:~# free
> total used free shared buffers cached
> Mem: 3979312 1493616 2485696 0 250056 395960
> -/+ buffers/cache: 847600 3131712
> Swap: 499704 0 499704

What instances are running on quake, and what is their memory/disk size?
What does "lvs" and "vgs" say for it?

> gnt-cluster verify and verify-disks run fine.
>
> Setup:
> All machines are Debian Squeeze, using kvm and the stock ganeti2 2.1.6 packages.

For KVM, it's harder to get reliable memory usage stats than for Xen,
hence (I would suspect) the weird memory values. For disk, are you using
the volume group for other volumes too?

Note: you could collect the entire cluster data by running "hscan node01",
which will create a node01.data file. It's a handy way to look at the
cluster resource usage.

regards,
iustin

Miguel Di Ciurcio Filho

unread,
Sep 28, 2010, 1:55:07 PM9/28/10
to gan...@googlegroups.com
Hello folks,

I'm experimenting with Ganeti and hbal, but right now I'm stuck with
this warning:

root@node01:~# hbal -m colossus
Warning: cluster has inconsistent data:
- node quake.xxx is missing -2967 MB ram and 23 GB disk

Loaded 3 nodes, 9 instances
Initial check done: 0 bad nodes, 0 bad instances.
Initial score: 0.31727619
Trying to minimize the CV...
No solution found
Solution length=0

root@node01:~# gnt-node list
Node DTotal DFree MTotal MNode MFree Pinst Sinst
node01.xxx 931.5G 869.8G 7.8G 564M 7.3G 3 3
node03.xxx 2.7T 2.7T 15.7G 860M 14.9G 3 3
quake.xxx 232.7G 148.2G 3.8G 737M 3.0G 3 3

root@quake:~# free
total used free shared buffers cached
Mem: 3979312 1493616 2485696 0 250056 395960
-/+ buffers/cache: 847600 3131712
Swap: 499704 0 499704

gnt-cluster verify and verify-disks run fine.

Setup:
All machines are Debian Squeeze, using kvm and the stock ganeti2 2.1.6 packages.

Regards,

Miguel

Miguel Di Ciurcio Filho

unread,
Sep 28, 2010, 2:49:33 PM9/28/10
to gan...@googlegroups.com
On Tue, Sep 28, 2010 at 3:18 PM, Iustin Pop <iu...@k1024.org> wrote:
> On Tue, Sep 28, 2010 at 02:55:07PM -0300, Miguel Di Ciurcio Filho wrote:
>> Hello folks,
>>
>> I'm experimenting with Ganeti and hbal, but right now I'm stuck with
>> this warning:
>>
>> root@node01:~# hbal -m colossus
>> Warning: cluster has inconsistent data:
>>   - node quake.xxx is missing -2967 MB ram and 23 GB disk
>>

>> root@quake:~# free


>>              total       used       free     shared    buffers     cached
>> Mem:       3979312    1493616    2485696          0     250056     395960
>> -/+ buffers/cache:     847600    3131712
>> Swap:       499704          0     499704
>
> What instances are running on quake, and what is their memory/disk size?
> What does "lvs" and "vgs" say for it?

The LVM problem does not surprise me, I have some orphan volumes
actually. The memory that is really strange.

root@node01:~# gnt-instance list|grep quake
araguaia.ic.unicamp.br kvm centos+default quake.ic.unicamp.br
running 1.0G
gandalf.ic.unicamp.br kvm centos+default quake.ic.unicamp.br
running 1.0G
prcolor.ic.unicamp.br kvm centos+default quake.ic.unicamp.br
running 1.0G

Some black magic to sum all memory used by the three 3 kvm instances:
root@quake:~# ps axfu| grep kvm| awk 'BEGIN
{total_vsz=0;total_rss=0}{total_vsz=total_vsz+$5;total_rss=total_rss+$6}END{printf("VSZ:
%d, RSS: %d\n", total_vsz/1024, total_rss/1024)}'

VSZ: 3705, RSS: 902

It seems that the inconsistency is growing. Some minutes ago was -2967, now:


root@node01:~# hbal -m colossus
Warning: cluster has inconsistent data:

- node quake.ic.unicamp.br is missing -3016 MB ram and 23 GB disk

Regards,

Miguel

Miguel Di Ciurcio Filho

unread,
Sep 28, 2010, 2:53:18 PM9/28/10
to gan...@googlegroups.com
On Tue, Sep 28, 2010 at 3:18 PM, Iustin Pop <iu...@k1024.org> wrote:
>
> Note: you could collect the entire cluster data by running "hscan node01",
> which will create a node01.data file. It's a handy way to look at the
> cluster resource usage.
>

I've forgot to include the result.

root@node01:~# hscan colossus
Name Nodes Inst BNode BInst t_mem f_mem t_disk f_disk Score
colossus 3 9 0 0 27967 23914 3959
3750 0.33827549

Iustin Pop

unread,
Sep 28, 2010, 3:05:22 PM9/28/10
to gan...@googlegroups.com

Thanks, this makes sense.

The problem is that for KVM, there is a discrepancy between VSZ and RSS,
if the instance is not using the entire memory already. We display the
VSZ, but the correct would be to use the RSZ from the instance.

I guess we need to improve, in Ganeti, the memory stats for KVM. In the
meantime, can you confirm that by allocating full memory in the
instances, you get rid of the warnings?

iustin

Miguel Di Ciurcio Filho

unread,
Sep 28, 2010, 4:06:54 PM9/28/10
to gan...@googlegroups.com
>> Some black magic to sum all memory used by the three 3 kvm instances:
>> root@quake:~# ps axfu| grep kvm| awk 'BEGIN
>> {total_vsz=0;total_rss=0}{total_vsz=total_vsz+$5;total_rss=total_rss+$6}END{printf("VSZ:
>> %d, RSS: %d\n", total_vsz/1024, total_rss/1024)}'
>>
>> VSZ: 3705, RSS: 902
>>
>> It seems that the inconsistency is growing. Some minutes ago was -2967, now:
>> root@node01:~# hbal -m colossus
>> Warning: cluster has inconsistent data:
>
> Thanks, this makes sense.
>
> The problem is that for KVM, there is a discrepancy between VSZ and RSS,
> if the instance is not using the entire memory already. We display the
> VSZ, but the correct would be to use the RSZ from the instance.
>
> I guess we need to improve, in Ganeti, the memory stats for KVM. In the
> meantime, can you confirm that by allocating full memory in the
> instances, you get rid of the warnings?
>

No apparent change after making the instances allocate all their RAM.

root@node01:~# hbal -m colossus
Warning: cluster has inconsistent data:

- node quake.ic.unicamp.br is missing -2303 MB ram and 23 GB disk

root@quake:~# ps axfu| grep kvm| awk 'BEGIN
{total_vsz=0;total_rss=0}{total_vsz=total_vsz+$5;total_rss=total_rss+$6}END{printf("VSZ:

%d, RSS: %d\n\n", total_vsz/1024, total_rss/1024)}'

VSZ: 3889, RSS: 3050

root@node01:~# gnt-node list|grep quake
quake.ic.unicamp.br 232.7G 148.2G 3.8G 2.8G 258M 3 3


gnt-node list is printing correct values, and it seams that it showing
258MB of actually free RAM, ignoring buffers/cache, comparing to what
I'm seeing in top.
total_free = total - (KVM + host_stuff) - buffers_and_cache

But this tool hbal is kinda lost. I have no clue from where it is
getting -2303 MB of missing RAM.

Regards,

Miguel

Iustin Pop

unread,
Sep 28, 2010, 4:29:00 PM9/28/10
to gan...@googlegroups.com

hbal gets its information from Ganeti too. Can you try to send me the
*.data file created by hscan?

iustin

Iustin Pop

unread,
Sep 28, 2010, 4:54:13 PM9/28/10
to gan...@googlegroups.com

Thanks for the file data, now I understand what is happening. Ganeti is
getting it wrong too.

Basically we see that node memory is 2.8G, but that includes the
instance memory. Then, hbal substracts that again, because for Xen it's
not reported in it.

Hrmm, this needs fixing indeed, separately from the accurate reporting
of per-instance memory.

So thanks, this is two bugs related to KVM memory statistics! I know
this doesn't help you right now, but at least we know the issues.

iustin

Miguel Di Ciurcio Filho

unread,
Sep 29, 2010, 9:48:38 AM9/29/10
to Iustin Pop, gan...@googlegroups.com
On Tue, Sep 28, 2010 at 5:54 PM, Iustin Pop <iu...@k1024.org> wrote:
>
> Basically we see that node memory is 2.8G, but that includes the
> instance memory. Then, hbal substracts that again, because for Xen it's
> not reported in it.
>
> Hrmm, this needs fixing indeed, separately from the accurate reporting
> of per-instance memory.
>
> So thanks, this is two bugs related to KVM memory statistics! I know
> this doesn't help you right now, but at least we know the issues.
>

No problem, I'm still planing the cluster, so there is time for a patch :-D

Do you think that opening a issue ticket would help?

Regards,

Miguel

Iustin Pop

unread,
Sep 29, 2010, 10:32:01 AM9/29/10
to Miguel Di Ciurcio Filho, gan...@googlegroups.com

It certainly won't hurt, so please please ahead.

iustin

Reply all
Reply to author
Forward
0 new messages