How to check the data size in the Voldemort?

xi...@tune.com

unread,

Feb 10, 2015, 1:36:33 PM2/10/15

to project-...@googlegroups.com

Hi,

We have installed voldemort on three nodes cluster and loaded several millions of key pair data into it. My questions are:

1. How to check the data size in the voldemort so we can know if the system is going to full or not?

2. Since we have three nodes in the cluster, how to distribute the put and get requests automatically? Currently we just use single node url to connect to the cluster. Ideally we can distribute the traffic to three nodes.

3. How do we know the memory requirement of the node if we know the size of the data?

Thanks,

Xinyu

Brendan Harris (a.k.a. stotch on irc.oftc.net)

unread,

Feb 10, 2015, 4:17:24 PM2/10/15

to project-...@googlegroups.com

Hi Xinyu,

We need a little more information from you before we can answer your questions:

- What version of voldemort are you running?

- What OS/distribution and version are you running voldemort on?

- What storage engine are your stores using (the <persistence> field in each store in stores.xml)?

- What kind of disks and disk configuration is the data being stored on?

- Are the disks being used for more than just voldemort storage?

- How much RAM is available per host?

- Can you please paste your voldemort properties, stores.xml and cluster.xml to this discussion?

Thanks,

Brendan

xi...@tune.com

unread,

Feb 11, 2015, 1:39:38 PM2/11/15

to project-...@googlegroups.com

Thanks Brendan. Here are my answers to your questions:

- What version of voldemort are you running?

1.9.6

- What OS/distribution and version are you running voldemort on?

Linux 2.6.32-431.el6.x86_64 x86_64 (We are using Amazon EC2 instances)

- What storage engine are your stores using (the <persistence> field in each store in stores.xml)?

bdb at this time, we are still in testing stage, so will test other storage engine also.

- What kind of disks and disk configuration is the data being stored on?

plan to use general ssd on AWS.

- Are the disks being used for more than just voldemort storage?

we will have dedicated storage for voldemort.

- How much RAM is available per host?

currently we are testing AWS r3.xlarge, which has 30G memory.

- Can you please paste your voldemort properties, stores.xml and cluster.xml to this discussion?

<!-- Note that "test" store requires 2 reads and writes,

so to use this store you must have both nodes started and running -->

<store>

<routing>client</routing>

<replication-factor>3</replication-factor>

<required-reads>1</required-reads>

<required-writes>1</required-writes>

<key-serializer>

<type>string</type>

</key-serializer>

<value-serializer>

<type>string</type>

</value-serializer>

<retention-days>1</retention-days>

</store>

cluster.xml

<!-- Note that "test" store requires 2 reads and writes,

so to use this store you must have both nodes started and running -->

<store>

<routing>client</routing>

<replication-factor>3</replication-factor>

<required-reads>1</required-reads>

<required-writes>1</required-writes>

<key-serializer>

<type>string</type>

</key-serializer>

<value-serializer>

<type>string</type>

</value-serializer>

<retention-days>1</retention-days>

</store>

root@ip-10-144-254-229 config]$ more cluster.xml

<name>mycluster</name>

<host>ec2-54-159-149-143.compute-1.amazonaws.com</host>

<http-port>8081</http-port>

<socket-port>6666</socket-port>

<admin-port>6667</admin-port>

</server>

<host>ec2-54-211-190-221.compute-1.amazonaws.com</host>

<http-port>8082</http-port>

<socket-port>6668</socket-port>

<admin-port>6669</admin-port>

</server>

<host>ec2-54-159-193-29.compute-1.amazonaws.com</host>

<http-port>8083</http-port>

<socket-port>6670</socket-port>

<admin-port>6671</admin-port>

</server>

</cluster>

Arunachalam

unread,

Feb 11, 2015, 1:51:01 PM2/11/15

to project-...@googlegroups.com

Thanks for the info Xinyu.

1. How to check the data size in the voldemort so we can know if the system is going to full or not?

Voldemort does not provide information on this. But it should come from your storage system. For our case, we just measure the disk capacity and we have alerts ( our own internal alerting system) which alerts when the disks are 70% full.

2. Since we have three nodes in the cluster, how to distribute the put and get requests automatically? Currently we just use single node url to connect to the cluster. Ideally we can distribute the traffic to three nodes.

The Url you use is a bootstrap URL. They are used only when the client is connecting for the first time. After that the client downloads the node information of the entire cluster and starts talking to each one of them individually. Generally what you will do is you put a vip (Virtual IP) for this bootstrap URL which just round robins between all the nodes. This way you can eliminate a single point of failure during the bootstrapping.

3. How do we know the memory requirement of the node if we know the size of the data?

This depends on your use case and requires run time tuning. If you are using BDB you need to ensure that the index B-Tree stays in the memory otherwise your read performance is going to be affected. It requires some runtime tuning when you are at full load, but start with something that is reasonable.

Thanks,

Arun.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

xi...@tune.com

unread,

Feb 11, 2015, 2:04:35 PM2/11/15

to project-...@googlegroups.com

Hi Arun,

Could you please provide an example on how to do this? Or is there any documentation I can follow?

2. Since we have three nodes in the cluster, how to distribute the put and get requests automatically? Currently we just use single node url to connect to the cluster. Ideally we can distribute the traffic to three nodes.

The Url you use is a bootstrap URL. They are used only when the client is connecting for the first time. After that the client downloads the node information of the entire cluster and starts talking to each one of them individually. Generally what you will do is you put a vip (Virtual IP) for this bootstrap URL which just round robins between all the nodes. This way you can eliminate a single point of failure during the bootstrapping.

Thanks a lot,

Xinyu

Arunachalam

unread,

Feb 11, 2015, 2:10:31 PM2/11/15

to project-...@googlegroups.com

That is a standard networking thing, nothing specific to Voldemort.

https://www.google.com/search?q=virtual+ip+load+balancing+linux

Also when I meant runtime tuning, I meant observe when the servers are running and see what is going wrong, based on the metrics and tune the parameters.