Definition structure zone with 20 services in 4 servers with replication

48 views
Skip to first unread message

Miguel Ausó

unread,
Aug 4, 2015, 4:48:45 PM8/4/15
to project-voldemort
Hi all!

We want to have a Zone structure .

But I just do not understand the replication Zones.

The final situation is:

We have 4 servers (RAM 256GB each), we use LXC, then we have more or less 5 LXC with Voldemort in each physical server

We will have a total of 20 instances of Voldemort.

I thought about making 5 zones, this way we will have Zone 0 in the 4 servers, also Zone 1, and so all Zones.

As we use LXC if server breaks., we will lose five instances of Voldemort, then we have to replicate our data to another server

I have two doubts

what is the best way to put proximity-list ?

And

How it will be replication-factor in stores configuration?

Probably something like this

<replication-factor zone-id="0">2</replication-factor>
<replication-factor zone-id="1">2</replication-factor>
<replication-factor zone-id="2">2</replication-factor>
<replication-factor zone-id="3">2</replication-factor>
<replication-factor zone-id="4">2</replication-factor>

I guess that in this case if data go to partition x in zone 0, then this data will be replicate in other node to the same zone. 

is it correct?

Félix GV

unread,
Aug 4, 2015, 5:23:03 PM8/4/15
to project-voldemort
Hi Miguel,

Thanks for your interest in Voldemort.

The concept of Voldemort zones is intended to be used for geographically distant data centers. All Voldemort servers which are in the same DC and which you intend to use in a given logical cluster should be within the same zone.

Voldemort clients will do their operations against the nodes in the local zone and then asynchronously replicate their writes to the distant ones. Zones are meant to indicate which machines are local versus which are a long hop of latency away. They are not meant to emulate "rack awareness" or "colocated containers" as you're describing. Thus, having a cluster with many zones in the same DC is not the intended use.

Furthermore, your set up with LXC containers is dangerous... As you pointed out yourself, the loss of one physical machine will result in many logical nodes going down. This will make it unnecessarily complicated and wasteful to provision your replication factor reliably.

Voldemort's design is intended to leverage its partitioning scheme in order to allow flexible and elastic cluster management. Moving containers around is not the intended way to scale or rebalance Voldemort.

Anyway, if you decide to go down that path anyway, please do let us know what you find out. Perhaps write a post to share your experience? I'd be curious to hear, even if it doesn't seem to me like a good use of Voldemort (or LXC...).

-F

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.



--
--
Félix

Miguel Ausó

unread,
Aug 4, 2015, 6:23:18 PM8/4/15
to project-voldemort
Hi Félix 

Thanks for the quick reply.

Now I can see that my solution probably is not the best solution, but I have 4 servers with 32 CPU and 256 GB RAM and some TeraBytes in SAS, some stores need replication and others no

I need uses these servers for  Voldemort Cluster, so another solution that I thought is that I might have only 4 Voldemort, one for each Server, I understand that Voldemort use JNA, then we can give 32 GB RAM to JVM and the rest of RAM will use for cache with JNA, that is 220GB.

But I didn't find any documentation whether it is a good practice

Do you help me to find a better solution?

Finally, I will write about the implemented solution, you do not hesitate, I've already got the first post 


Thanks!

Arunachalam

unread,
Aug 4, 2015, 7:47:48 PM8/4/15
to project-...@googlegroups.com
Voldemort uses JNA for mlock and few others, unfortunately cache is not one of them. Voldemort main storage engine is BDB, which caches inside of the JVM heap. Hotspot JDK we have never tried it with anything larger than 32 GB. I have also heard stories about JDK stalling multiple seconds when you have a heap that is larger than 64GB. 

Voldemort is mostly designed for commodity hardware. 256GB RAM is one of the use case where it is not designed for, you might hack around it, but in my opinion you should explore other options.

Thanks,
Arun.

--

Félix GV

unread,
Aug 4, 2015, 8:30:38 PM8/4/15
to project-voldemort
We do run Voldemort RW clusters with 32 GB of heap, but we run Voldemort RO clusters with much less (around 4 GB of heap I think).

For Voldemort RO, we mostly just leverage the OS' page cache, so that should scale well even with large amounts of RAM.

For Voldemort RW, BDB may still benefit from the OS' page cache, but we haven't tested that assumption on large amounts of RAM.

Again, I encourage you to report back here and/or on your blog if you try it out (:

-F





--
--
Félix

Arunachalam

unread,
Aug 4, 2015, 8:36:52 PM8/4/15
to project-...@googlegroups.com
As Felix rightly pointed out, if it was for RO 256GB RAM should work as the OS manages the memory, there is no JVM. My comment applies only for the RW servers.

Thanks,
Arun.

Miguel Ausó

unread,
Aug 5, 2015, 6:42:44 AM8/5/15
to project-voldemort
Thanks for the reply

I think with my resources I have to re- focus on my first solution.

I read several times the zone documentation, but I do not understand replication between them. https://github.com/voldemort/voldemort/wiki/Topology-awareness-capability

I  have 4 servers with 5 lxc zone each one (20 Voldemort instances), then I think in two possibles solutions

For the example easier, each server has 3 partitions, and we assign them sequentially.


The configure are the same for the two examples

Cluster config.xml

<cluster>
  <name>cluster-with-two-zones</name>
  <zone>
    <zone-id>0</zone-id>
    <proximity-list>1</proximity-list>
  </zone>
  <zone>
    <zone-id>1</zone-id>
    <proximity-list>2</proximity-list>
  </zone>
  <zone>
    <zone-id>2</zone-id>
    <proximity-list>3</proximity-list>
  </zone>
  <zone>
    <zone-id>3</zone-id>
    <proximity-list>0</proximity-list>
  </zone>

Store conf

<replication-factor>4</replication-factor>
<zone-replication-factor>
    <replication-factor zone-id="0">1</replication-factor>
    <replication-factor zone-id="1">1</replication-factor>
    <replication-factor zone-id="2">1</replication-factor>
    <replication-factor zone-id="3">1</replication-factor>
</zone-replication-factor>

Example 1 (attach png)



In this example we can see each server with LXC it like a datacenter, and LXC are a servers.

we need that the data to partition 0 (this partition is in zone 0 ) it will replicate in other zone, then if Server 0 broke I don't lose data, so with these configuration is this possible??

The problem is I don't understand if the data replication in the zone X will do in other server in the same zone or in other server in the other zone.

If the behavior is that I will have the data in other server but the same zone, I think in the next solution.


Exemple 2 (attach png)



If the data will replicate in the other server but the same zone, in this solution one server can broke 

what do you think is it the best solution? 
what is the correct behavior in zone replication?  

thanks!

-F

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

Félix GV

unread,
Aug 5, 2015, 3:02:50 PM8/5/15
to project-voldemort
Hi Miguel,

So, here are three solutions which I think you should try, in order of priority:
  1. Non-zoned, 1 node per host with small heap (ideally 32 GB, or maybe 64 GB max) and storage engine BDB. 
  2. Non-zoned, 1 node per host with very small heap (around 4 GB) and storage engine RocksDB.
  3. Zoned, multiple nodes per host with small heap (32 GB or so) and storage engine BDB, zones are 1:1 with hosts (like in your example 1), replication factor is equal to number of hosts (or maybe less than that if that's too wasteful...), zone-replication-factor is 1 (that's important, otherwise you'll get duplicated data within your hosts).
Here is the rationale for the above suggestions:

Suggestion #1: BDB is well tested, and using a single zone per data center is well tested as well. The only important thing with BDB is that your index (not your data) fits in memory, so that typically doesn't require huge amounts of RAM, though of course it varies based on your data set. The rest of your RAM will be left to the OS to use as page cache, which will speed up your IO (and you will likely need as much of that as you can get since you are running on SAS hard drives rather than SSDs). So this solution is actually very similar to what we've been running in prod for a long time, except with more page cache and SAS instead of SSDs. Not much risk. The only risk is if your BDB index doesn't fit in the heap, then you'll need to increase the heap, and it's not clear how much more than 32 GB can you go before seeing GC issues.

Suggestion #2: RocksDB support was added a few months ago. Since that storage engine is a native process, it shouldn't suffer from GC problems. At least you wouldn't be doing something funky with the zones, so that's still kind of close to what we've been running in prod. The risk is higher for this suggestion because RocksDB support is still considered experimental. We haven't used it much yet (and not in prod), so you may hit a few road bumps.

Suggestion #3: If you're going to use multiple zones within a single data center, then I think you need to go with your example configuration 1. The example 2 will have very weird failure characteristics and wasteful replication, so I advise against it. The replication settings I suggested ensure that you do not get the same data twice on the same host, since that would be a waste of resources... In theory that should work, but I have a feeling you'll hit some road bumps here as well. Not to mention you'll need to deal with multiple Voldemort processes contending to access shared disk and network IO. CPU cores and RAM are easy to isolate with LXC but IO is typically not as straightforward. It will likely work in low throughput but then again probably any solution would work in low throughput (: ...

It's not clear to me what is less risky between suggestion #2 and #3, but I think both of those are definitely riskier than suggestion #1.

Whatever you decide, please do report back your findings!

My two cents.

-F


-F

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.





--
--
Félix

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.




--
--
Félix

Miguel Ausó

unread,
Aug 6, 2015, 9:47:53 AM8/6/15
to project-voldemort
Hi Felix,

Thanks for you reply.

I think that I will try Suggestion #3, because is the solution comes closest with my hardware.

When I will have production env I will send you a report (post) :)

Thanks for you time


-F

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.





--
--
Félix

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.




--
--
Félix

Miguel Ausó

unread,
Sep 15, 2015, 10:06:30 AM9/15/15
to project-voldemort
Hi Felix 

we are working in the solution #3, now we have production cluster and we are doing stress test before to go live.

we have a strange behavior, when we check store meta we have a lot errors about partition 31, it seems the check looking for this partition in all servers (we have this error for each server and each store) but this partition is in node bcn1-cache-vold-096p4, how we defined in cluster.xml

when the check arrive to node server-096p4 (this server has partition 31) we don't have a error. 

for other hand when we run the stress with voldemort-performance-tool.sh, we don't have errors. 

any ideas? 

Attached cluster.xml file 

Error 

  Error doing sample key get from Store Searches Node Host 'server-096p3' : ID 6 Error Client accessing key belonging to partitions [31] not present at Node server-096p3 Id:6 in zone 0 partitionList:[43, 7, 14, 42
       ]
       voldemort.store.InvalidMetadataException: Client accessing key belonging to partitions [31] not present at Node server-096p3 Id:6 in zone 0 partitionList:[43, 7, 14, 42]
               at sun.reflect.GeneratedConstructorAccessor2.newInstance(Unknown Source)
               at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
               at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
               at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:116)
               at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:103)
               at voldemort.store.ErrorCodeMapper.getError(ErrorCodeMapper.java:76)
               at voldemort.client.protocol.vold.VoldemortNativeClientRequestFormat.checkException(VoldemortNativeClientRequestFormat.java:298)
               at voldemort.client.protocol.vold.VoldemortNativeClientRequestFormat.readGetResponse(VoldemortNativeClientRequestFormat.java:114)
               at voldemort.store.socket.clientrequest.GetClientRequest.parseResponseInternal(GetClientRequest.java:63)
               at voldemort.store.socket.clientrequest.GetClientRequest.parseResponseInternal(GetClientRequest.java:31)
               at voldemort.store.socket.clientrequest.AbstractClientRequest.parseResponse(AbstractClientRequest.java:74)
               at voldemort.store.socket.clientrequest.BlockingClientRequest.parseResponse(BlockingClientRequest.java:74)
               at voldemort.store.socket.clientrequest.ClientRequestExecutor.read(ClientRequestExecutor.java:248)
               at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:105)
               at voldemort.common.nio.SelectorManager.run(SelectorManager.java:214)
               at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
               at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
               at java.lang.Thread.run(Thread.java:745)
       processing Host 'server-096p4' : ID 7


cluster.xml

Arunachalam

unread,
Sep 15, 2015, 10:13:07 AM9/15/15
to project-...@googlegroups.com
Is all your cluster.xml in all nodes in sync ?

Thanks,
Arun.


-F






--
--
Félix




--
--
Félix
Message has been deleted

Miguel Ausó

unread,
Sep 15, 2015, 11:10:33 AM9/15/15
to project-voldemort
It's seems ok 
bin/vadmin.sh  meta check cluster.xml -u tcp://127.0.0.1:6666
processing Host 'server-095p1' : ID 0
processing Host 'server-095p2' : ID 1
processing Host 'server-095p3' : ID 2
processing Host 'server-095p4' : ID 3
processing Host 'server-096p1' : ID 4
processing Host 'server-096p2' : ID 5
processing Host 'server-096p3' : ID 6
processing Host 'server-096p4' : ID 7
processing Host 'server-097p1' : ID 8
processing Host 'server-097p2' : ID 9
processing Host 'server-097p3' : ID 10
processing Host 'server-097p4' : ID 11
processing Host 'server-098p1' : ID 12
processing Host 'server-098p2' : ID 13
processing Host 'server-098p3' : ID 14
processing Host 'server-098p4' : ID 15
cluster.xml metadata check : PASSED

AND 

bin/vadmin.sh  meta check-version -u tcp://127.0.0.1:6666
 Node : 0 Version : version() ts:1438175142424
 Node : 1 Version : version() ts:1438175142424
 Node : 2 Version : version() ts:1438175142424
 Node : 3 Version : version() ts:1438175142424
 Node : 4 Version : version() ts:1438175142424
 Node : 5 Version : version() ts:1438175142424
 Node : 6 Version : version() ts:1438175142424
 Node : 7 Version : version() ts:1438175142424
 Node : 8 Version : version() ts:1438175142424
 Node : 9 Version : version() ts:1438175142424
 Node : 10 Version : version() ts:1438175142424
 Node : 11 Version : version() ts:1438175142424
 Node : 12 Version : version() ts:1438175142424
 Node : 13 Version : version() ts:1438175142424
 Node : 14 Version : version() ts:1438175142424
 Node : 15 Version : version() ts:1438175142424
All the nodes have the same metadata versions.
avro-example=0
stores.xml=0
cluster.xml=0
test=0

-F

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.





--
--
Félix

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.




--
--
Félix
Reply all
Reply to author
Forward
0 new messages