how does replication occurs in voldemort?

B H

unread,

Nov 23, 2016, 10:05:10 PM11/23/16

to project-voldemort

I traced the codes and logs but still get catch it. Does the server node 1 communicate to the server node 2 for replication?

Félix GV

unread,

Nov 23, 2016, 10:23:15 PM11/23/16

to project-voldemort

No, replication is handled by the client in Voldemort.

On Wed, Nov 23, 2016 at 19:05 B H <belso...@gmail.com> wrote:

I traced the codes and logs but still get catch it. Does the server node 1 communicate to the server node 2 for replication?

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

B H

unread,

Nov 23, 2016, 10:52:30 PM11/23/16

to project-voldemort

How and when does that happen? Let's say the client puts a key-value pair in the store. It first establishes connection to say node 0. How does node 1 get the replicated data?

Felix GV

unread,

Dec 1, 2016, 6:28:10 PM12/1/16

to project-...@googlegroups.com

The client establishes connections to all nodes at start up time. A put is done in a blocking way against enough replicas to fulfill write consistency settings for that store, and on the rest of the replicas in a non-blocking (asynchronous) way.

--

Felix GV
Senior Software Engineer
Data Infrastructure
LinkedIn

f...@linkedin.com
linkedin.com/in/felixgv

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

Arunachalam

unread,

Dec 1, 2016, 6:42:05 PM12/1/16

to project-...@googlegroups.com

Felix is right.

Voldemort works in the following way. Given a key, it maps it to a partition (Virtual). The partition, is hosted by multiple nodes ( number of replica specified in the store definition).

So given a key, voldemort identifies the nodes to distribute this key to . Then it waits for the write required ( writes-required number specified in the store definition) to respond back successfully before returning success to the client.

When client reads inconsistent values ( depending on reads-required), client fixes the value on the server.

So if you emphasis on consistent data, you should run the store atleast in

3 replication factor , 2 writes and 2 read required.

Depending on the consistency, you can tune it to 2 preferred reads for optimizing the performance.

Thanks,

Arun.

David Ongaro

unread,

Dec 2, 2016, 7:35:41 PM12/2/16

to project-...@googlegroups.com

Something only vaguely related to the question but I'm wondering about for some time: can't we do something similar for readonly stores? It's great to have the BPRO feature which saves many resources on hadoop but it's a pity that the fetcher still have to fetch all replicas from hadoop. If the fetcher can only fetch each file once and then replicate accordingly to other nodes within the cluster that would save a lot of network resources. Obviously the available bandwidth within a rack is much greater than over DCs.

I'm clear that we talk about different technical requirements here, since a whole file is something different than just a key/value update, but since the nodes can talk to each other I think it should be feasible?

On Thu, Dec 1, 2016 at 3:42 PM, Arunachalam <arunac...@gmail.com> wrote:

Felix is right.

Voldemort works in the following way. Given a key, it maps it to a partition (Virtual). The partition, is hosted by multiple nodes ( number of replica specified in the store definition).

So given a key, voldemort identifies the nodes to distribute this key to . Then it waits for the write required ( writes-required number specified in the store definition) to respond back successfully before returning success to the client.

When client reads inconsistent values ( depending on reads-required), client fixes the value on the server.

So if you emphasis on consistent data, you should run the store atleast in
3 replication factor , 2 writes and 2 read required.

Depending on the consistency, you can tune it to 2 preferred reads for optimizing the performance.

Thanks,
Arun.

Arunachalam

unread,

Dec 2, 2016, 7:52:20 PM12/2/16

to project-...@googlegroups.com

David,

We worked on it during a spare time, but could never find the time to make it official. The code exists with unit tests, but never tested at scale.

https://github.com/arunthirupathi/voldemort/tree/primaryPartitionFetch_2

The primary problem is Voldemort admin Streaming is broken and does NIO in a weird way. This is a destabilizing change which needs to be fixed before making files stream between nodes.

I could never find time and since we did not make this project official no testing at scale. Hence this remains unmerged to the master till date.

Thanks,

Arun.

Felix GV

unread,

Dec 2, 2016, 8:23:46 PM12/2/16

to project-...@googlegroups.com

Yeah,

Arun's right. WAN bandwidth is an expensive resource which we would like to minimize, but we haven't gotten around to it yet. Arun's way of doing it is the "clean way", I think, but it's also a high touch / fairly risky change.

In my opinion, BPRO was a preliminary step in order to cleanly implement "fetch.primary.replica.only" (FPRO?), but Arun's branch was not tested in a real environment.

I think there's some guy from Apple who actually tried it out the branch and said that "it worked", but of course that doesn't guarantee the absence of non-deterministic regressions hidden in there.

That being said, we may have a more low touch way of saving bandwidth coming up sometime next year, if the stars align ;) ...

--

Felix GV
Senior Software Engineer
Data Infrastructure
LinkedIn

f...@linkedin.com
linkedin.com/in/felixgv

Arunachalam

unread,

Dec 2, 2016, 8:37:20 PM12/2/16

to project-...@googlegroups.com

I second Felix, it just has unit tests and never tested in any real world environment. So the chances of bugs/gaps are 99.9% .

Thanks,

Arun.

David Ongaro

unread,

Dec 4, 2016, 2:55:36 AM12/4/16

to project-voldemort

I see. That's a pity. Can't you talk to your bosses to make this an official project? If you point out that they could save a lot of money they might listen...

Arunachalam

unread,

Dec 4, 2016, 1:13:28 PM12/4/16

to project-...@googlegroups.com

The money saving is not that much for our use case, when we already throttle the bandwidth utilization by the fetch jobs.

Thanks,

Arun.

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

Sam

unread,

Jul 6, 2018, 5:46:42 AM7/6/18

to project-voldemort

Hi Felix,

I am upgrading our production voldemort to 1.10.25. It will be very helpful if you answer my below queries:

1. How voldemort cleaning stale data from cache?

2. How voldemort is doing memery management ?

3. Can we increase/decrease number of partitions from 2048?

4. How to configure local cache and global cache in voldemort?
5. How can we define global replication factor and local replication factor?