HyperDexClientException: reconfiguration affecting virtual_server when putting data

202 views
Skip to the first unread message

Tadeusz Kopeć

unread,
22 Mar 2014, 06:46:3222/03/2014
to hyperdex...@googlegroups.com
Hello

I hired 3 machines from Amazon EC2 (m1.small) and tried to do some tests with hyperdex. I installed Ubuntu 13.10 (Saucy), installed hyperdex as in tutorial.
On machine 1 I started coordinator:
hyperdex coordinator -d -l <machine1_IP> -p 1982 -D /mnt/hyperdex/coordinator/
and daemon
hyperdex daemon -d --listen=<machine1_IP> --listen-port=2012 --coordinator=<machine1_IP> --coordinator-port=1982 --data=/mnt/hyperdex/daemon

On machine 1 and 2 I started daemons
hyperdex daemon -d --listen=<local_machine_IP> --listen-port=2012 --coordinator=<machine1_IP> --coordinator-port=1982 --data=/mnt/hyperdex/daemon

Then I ran python script which operates on DB, but it fails during an attempt to put data (I try to put 10,000 records, about 9,000 succeeds). The message is:

File "keyValueStorage_small.py", line 34, in possiblyWaitAfterAsyncCommandIssued
    client.loop().wait()
  File "client.pyx", line 959, in hyperdex.client.Deferred.wait (bindings/python/hyperdex/client.c:9674)
  File "client.pyx", line 806, in hyperdex.client.hyperdex_python_client_deferred_encode_status (bindings/python/hyperdex/client.c:7244)
hyperdex.client.HyperDexClientException: HyperDexClientException: reconfiguration affecting virtual_server(3408)/server(4756571285073274212) [HYPERDEX_CLIENT_RECONFIGURE]

When I look into coordinator logs problems seem to start with:
I0322 10:04:39.972847  2036 object_manager.cc:1088] hyperdex:server_suspect @ 19595: changing server(4756571285073274212) from AVAILABLE to NOT_AVAILABLE because we suspect it failed
...
I0322 10:04:39.973033  2036 object_manager.cc:1088] hyperdex:server_suspect @ 19595: issuing new configuration version 2566

Then it changes to back AVAILABLE, another server changes to NOT_AVAILABLE and it goes so for about 4 minutes.

Logs of daemon that failed first are:
I0322 10:01:10.981914  2518 state_transfer_manager.cc:149] ending outgoing transfer(3449)
I0322 10:01:10.983396  2518 daemon.cc:460] reconfiguration complete; resuming normal operation
I0322 10:04:39.974179  2518 daemon.cc:442] moving to configuration version=2566; pausing all activity while we reconfigure
I0322 10:04:40.183389  2518 daemon.cc:460] reconfiguration complete; resuming normal operation

Is it some known issue? Unfortunately hyperdex.org site is down today.

Thanks for any help
Tadeusz

Ramesh

unread,
22 Mar 2014, 10:57:2722/03/2014
to hyperdex...@googlegroups.com, Tadeusz Kopeć
Hello Tadeusz
Seems like you are saving coordinate and daemon data at same location. Try changing saving data at different locations and restart.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Tadeusz Kopeć

unread,
23 Mar 2014, 09:08:3223/03/2014
to hyperdex...@googlegroups.com, Tadeusz Kopeć
Hello Ramesh

Thanks for your response. In coordinator command line I have "-D /mnt/hyperdex/coordinator/"
and in daemon I have "--data=/mnt/hyperdex/daemon"
Aren't those directories separate enough? Or are there other directories I should override in command line?

Robert Escriva

unread,
23 Mar 2014, 10:41:4523/03/2014
to hyperdex...@googlegroups.com, Tadeusz Kopeć
I think what's happening here in the original issue is that the server
is becoming unavailable because it stops responding to requests. The
coordinator correctly removes it from the configuration, leading to the
reconfigure message you receive. Upon receipt of the RECONFIGURE
message, you should retry the request, because the server handling it on
your behalf failed/was removed from the config.

If this happens often, it may be a sign that you need to upgrade your
VMs on AWS to have more I/O bandwidth, or follow the tuning instructions
here:
http://hyperdex.org/doc/latest/TuningHyperDex/#sec:tuning:filesystem

If the problem persists, we'll work with you to sort out the cause and
publish a solution.

-Robert

On Sun, Mar 23, 2014 at 06:08:32AM -0700, Tadeusz Kopeć wrote:
> Hello Ramesh
>
> Thanks for your response. In coordinator command line I have "-D /mnt/hyperdex/
> coordinator/"
> and in daemon I have "--data=/mnt/hyperdex/daemon"
> Aren't those directories separate enough? Or are there other directories I
> should override in command line?
>
> W dniu sobota, 22 marca 2014 15:57:27 UTC+1 użytkownik Ramesh napisał:
>
> Hello Tadeusz
> Seems like you are saving coordinate and daemon data at same location. Try
> changing saving data at different locations and restart.
>
> On March 22, 2014 5:46:32 AM CDT, "Tadeusz Kopeć" <[1]tadeusz...@gmail.com>
> Is it some known issue? Unfortunately [2]hyperdex.org site is down
> today.
>
> Thanks for any help
> Tadeusz
>
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> --
> You received this message because you are subscribed to the Google Groups
> "hyperdex-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to [3]hyperdex-discu...@googlegroups.com.
> For more options, visit [4]https://groups.google.com/d/optout.
>
> References:
>
> [1] javascript:
> [2] http://hyperdex.org/
> [3] mailto:hyperdex-discu...@googlegroups.com
> [4] https://groups.google.com/d/optout

Tadeusz Kopeć

unread,
24 Mar 2014, 08:22:5924/03/2014
to hyperdex...@googlegroups.com, Tadeusz Kopeć
Hi Robert

I tried tuning, but it didn't help. I put data on instance store which AFAIK is a physical HDD. Surprisingly my tests worked on micro-instance, three daemons run on single machine. So maybe it's not a problem of I/O bw, but network performance.
I could not find what system requirements hyperdex has. Maybe I am somehow abusing it? What I'd wish to do, is to put 1,000,000 records, each having 256 chars key and 2K chars value and then test how gets, searches and updates perform.
I tried first putting 10,000 records using code:
MAX_OUTSTANDING = 1024

def possiblyWaitAfterAsyncCommandIssued(client, outstanding):
        if outstanding > MAX_OUTSTANDING:
                client.loop().wait()
        else:
                outstanding += 1
        return outstanding

def putValuesToDB_async(client, space, numValues):
        outstanding = 0
        for i in range(numValues):
                client.async_put(space, generateKey(i*10 + random.randint(0, 10)), {'value' : generateValue(2048)})
                outstanding = possiblyWaitAfterAsyncCommandIssued(client, outstanding)
        [client.loop().wait() for _ in range(outstanding)]

I then changed MAX_OUTSTANDING to 10, but it didn't help
Is my approach wrong? Is hyperdex not suitable for such use?
In case of overloaded system I would rather expect increased latency, not dropping overloaded agent and reconfiguration (it causes domino effect - other agents get more load, so they are dropped, dropped agent are restored and reconfiguration keeps going). Is it some way configurable how eager is coordinator to drop agents which respond too slowly?

Thanks for your support
Tadeusz

Tadeusz Kopeć

unread,
6 May 2014, 10:51:3706/05/2014
to hyperdex...@googlegroups.com, Tadeusz Kopeć
Hi Robert

Is there any progress? Is there something wrong I am doing?

Robert Escriva

unread,
10 May 2014, 08:49:1810/05/2014
to hyperdex...@googlegroups.com, Tadeusz Kopeć
We have done some work to improve stability during reconfiguration.
It's present in the current Git code, and will be in the next release.

-Robert

Holger Winkelmann

unread,
29 Jan 2015, 05:22:2329/01/2015
to hyperdex...@googlegroups.com, tadeusz...@gmail.com
Hi
As this was posted in may last Year, are the changes in the releases after that? 

We see similar problems with v 1.6 as reported here:


-Robert

Robert Escriva

unread,
2 Feb 2015, 12:46:4202/02/2015
to hyperdex...@googlegroups.com, tadeusz...@gmail.com
The specific changes I mentioned in that email were incorporated into
the subsequent releases.

We do make changes to improve reconfiguration performance, and it sounds
like the similar issues you've reported warrant some more attention to
this topic.

-Robert
> --
> You received this message because you are subscribed to the Google Groups
> "hyperdex-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to hyperdex-discu...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dan

unread,
20 Jul 2015, 14:07:5120/07/2015
to hyperdex...@googlegroups.com, tadeusz...@gmail.com
I've been encountering this recently in some EC2 setups.  My configuration features 5 daemons, 2 of which are facing this reconfiguration/moving cycle (ubuntu 14.04, hyperdex 1.8.1, m3.xl). When I do load testing on the cluster, 3 of the servers are seeing near 100% cpu util, and these 2 are doing no work.

Did anyone ever determine what needed to be tuned to help fix this? 

Thanks,

-Dan

Matthew Tamayo

unread,
20 Jul 2015, 15:01:0120/07/2015
to hyperdex...@googlegroups.com
I think this issue has been resolved on 1.8.

-mtr

Dan

unread,
4 Aug 2015, 18:01:2404/08/2015
to hyperdex-discuss
Hi Matthew,

I am running on 1.8.1 and encountering this.

-Dan

Matthew Tamayo

unread,
4 Aug 2015, 18:20:3304/08/2015
to hyperdex...@googlegroups.com
Weird, we'd have to get our benchmark setup again, but we haven't seen this. This tends to happen when a coordinator gets partitioned or something goes wrong with coordinators and they start replaying logs to get on the same page.

-mtr

Emin Gün Sirer

unread,
8 Oct 2015, 14:36:2208/10/2015
to hyperdex...@googlegroups.com
Thanks Matthew, moving this to the github bug tracker: https://github.com/rescrv/HyperDex/issues/225

Jordan Menzin

unread,
13 Oct 2015, 21:17:2713/10/2015
to hyperdex-discuss
Hi,

I'm also reliably able to reproduce this.  It's happening when I'm slamming hyperdex with writes.  Would be happy to provide more information if I can be helpful.

-Jordan

Emin Gün Sirer

unread,
13 Oct 2015, 21:19:4513/10/2015
to hyperdex...@googlegroups.com
Hi Jordan,

Yes, we'd appreciate info on the platform (OS + HyperDex version) you are using, whether you compiled from source or are using packaged binaries, and sufficient details to reproduce the error.

Many thanks!
- egs


Jordan Menzin

unread,
13 Oct 2015, 21:52:3213/10/2015
to hyperdex-discuss
Hi,

CentOS 6.5 and HyperDex 1.8.1.  I'm using the packaged binaries.  I'm using 16 client processes writing roughly 10kb-100kb blobs with async puts in batches.  Roughly 128 concurrent writers.

Currently running with just 1 coordinator for testing purposes.  

I am using the Java client - one thing that I did a little strangely is I installed the java client rpm which depends on openjdk, but run using oracle jdk build 1.7.0_79-b15 to avoid spending the time compiling from source.

When I run a smaller "job" I don't see these errors. Only happens when writing fairly heavily.  Load averages and iowait are nothing crazy though, so I'm surprised things are overloaded.  

Error message:

org.hyperdex.client.HyperDexClientException: reconfiguration affecting virtual_server(133)/server(4385455619558848505)

org.hyperdex.client.HyperDexClientException: reconfiguration affecting virtual_server(133)/server(4385455619558848505)

    at org.hyperdex.client.Deferred.waitForIt(Native Method)  

Hope that helps a little bit.

Best,

Jordan

Emin Gün Sirer

unread,
13 Oct 2015, 21:54:2613/10/2015
to hyperdex...@googlegroups.com
This is very useful, thank you. We currently suspect that this is an issue with the Java bindings. We'll take a closer look shortly.
Reply all
Reply to author
Forward
0 new messages