Read-only bnp transfer rate

42 views
Skip to first unread message

Apurva

unread,
Sep 8, 2016, 7:34:06 PM9/8/16
to project-voldemort
Hi all,

We are running bnp jobs and observed file transfer throughput to be less than 0.2mb/s in case of remote data-center and in case of local dc we are seeing the transfer rate  ~10mb/s (thats the expected throughput as fetcher.max.bytes.per.sec=10485760). Is there any other parameter we need to tweak around? One thing we observed was when the file size increased we see better throughputs for remote dc than local dc.

Thanks always for the help!

Best,
Apurva

Felix GV

unread,
Sep 9, 2016, 7:34:31 PM9/9/16
to project-...@googlegroups.com
Hi Apurva,

There is definitely an overhead in having many small files, so your observations make sense. This overhead is exacerbated by long latencies (which would be the case when fetching remotely).

If you have small data sets, then I think the only thing you can do to reduce the number of files is to build a Voldemort cluster with fewer partitions. Since this is a cluster-level config, and not a store-level config, you need to be careful with how you set this config, unless you are willing to have a small cluster with few partitions for small stores, and a second cluster with more partitions for bigger stores.

Besides that, there may be some OS / TCP-level settings that could be tweaked for long latency links. We, at LinkedIn, haven't had much opportunity to play around with this yet, so we can't offer much guidance, but we believe there may be some room for improvement on that front as well.

-F

--
Felix GV
Senior Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

Apurva

unread,
Sep 13, 2016, 1:57:43 PM9/13/16
to project-voldemort
Hi Felix,

Thanks for sharing your findings. Just curious, is there a setting from voldemort side to specify the size of file generated on nodes?

Thanks,
Apurva
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

Felix GV

unread,
Sep 13, 2016, 5:11:12 PM9/13/16
to project-...@googlegroups.com
Yes, you can set the build.chunk.size config in the Build and Push job. However, you will never get less than 1 chunk per partition, so the number of partitions in the cluster is the biggest contributor to the number of files (at least for small stores).

--
Felix GV
Senior Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv


To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

Arunachalam

unread,
Sep 13, 2016, 5:29:28 PM9/13/16
to project-...@googlegroups.com
How many partitions do you have in the Server ? For this reason, you should keep the (Num. Of Partitions / Number of Servers) = 15 ( or Max 20).

I assume you are running a 50-100 node cluster. If you are running a very big cluster, then smaller partition number means on node failure the load is not evenly distributed.

Thanks,
Arun.

David Ongaro

unread,
Oct 21, 2016, 12:56:12 AM10/21/16
to project-voldemort
On Tuesday, September 13, 2016 at 2:29:28 PM UTC-7, Arun Thirupathi wrote:
How many partitions do you have in the Server ? For this reason, you should keep the (Num. Of Partitions / Number of Servers) = 15 ( or Max 20).

Is this true? We have 7000 partitions for our 15 node cluster. Is there any disadvantage in using this many partitions?

Currently we have 3 stores around 3 to 4 TB. For the smaller ones we get 4 chunks per partition á 80 to 120 MB, for the bigger one 2 chunks á 280 MB. That seems reasonable to me, but I don't really know the implications of exceeding your rule.

Félix GV

unread,
Oct 21, 2016, 1:51:14 AM10/21/16
to project-voldemort
If you only have stores in the 3+TB range, then it looks like you're getting decent sizes per chunk, and you're probably fine.

The tradeoff with large numbers of partitions in RO & BnP are the following:

1) In its current implementation, BnP cannot run less than one reducer per partition. In your case, spinning up 7000 reducer tasks may be quite a big overhead, but if you're not feeling it, then there's probably no reason to worry too much about it.

2) In HDFS, it is generally preferable to have few big files rather than many small files. Since partitions produce at least two files each (index + data), that causes extra load and memory pressure on the Hadoop NameNode.

3) On the V servers, every file takes up a file descriptor. This is not a big deal, as you can just increase the OS limits, but it is yet another resource.

4) The above factors result in a deployment that does not scale *down* gracefully for small store sizes if you have too many partitions. If you have use cases with small amounts of data (few records and/or small records), it's really not worth it to throw all of the above resources at the problem. Since V clusters have a statically defined number of partitions for all stores hosted on it, it is a bit inflexible if you want to have a wide range of store sizes. This is particularly important for us at LinkedIn: we have hundreds of stores, with new ones every week, and they range from a handful of records all the way up to billions of records, from KB to TB.

On the other hand, the main advantage of over-partitioning is to accommodate future expansion. Arun's figure of 20 partitions per server allows you to grow maybe 4x comfortably (bringing you to 5 partitions per host, around which point, skewed data distributions may cause hot nodes and become problematic) and theoretically up to 20x if your data is perfectly distributed. In our case, we have chosen to build new clusters when we fill up the old ones, rather than relying exclusively on expanding them, so this amount of over-partitioning works out ok for us.

The small number of nodes per cluster (20-60) in turn allows us to run with a replication factor of just 2 on most clusters, which would probably not be sufficient if we tried to deploy cluster with hundreds of nodes each, since correlated failures would be much more common in that scenario.

Another aspect is that when you have large store sizes, you can execute BnP at a higher throughput the more files you generate. Fortunately, BnP supports chunks in order to cope with this. Large stores are automatically (or via configs) chunked more aggressivey, which can increase the paralellism of both the build and push phases.

Therefore, I think it is best to choose a number of partition solely based on cluster size and future expendability, and to let chunking take care of increasing the throughput for large stores. But that doesn't mean 7000 partitions can't work. We simply haven't tried it (:

I hope that helps.

David Ongaro

unread,
Oct 21, 2016, 4:32:42 AM10/21/16
to project-voldemort
Thanks for the comprehensive answer! Indeed 7000 reducers are a little on the high side, I would prefer more like in the order of 2000. But even then we still would be way over Arun's suggested range of partitions/nodes = 15..20

Namenode load is also always a concern, but in that regard we have other construction areas where we could do much better, so I don't worry about that so much here. And I guess if we cut down the number of partitions we would get more chunks, so the number of files shouldn't really change much.

But in fact we have a few other clusters where we have some smaller stores down to 150 GB and the number of partitions is with 6666 only marginally less. I guess cutting it down to at most 1000 would help here. But it's always a hassle to change the cluster.xml of a cluster in production because it also has to be synced with new stores. So maybe just changing it when building a new cluster (like you suggest) would be the easiest. The only problem is that we don't do that often.

Felix GV

unread,
Oct 21, 2016, 2:25:17 PM10/21/16
to project-...@googlegroups.com

On Fri, Oct 21, 2016 at 1:32 AM, David Ongaro <bitt...@gmail.com> wrote:
I guess if we cut down the number of partitions we would get more chunks, so the number of files shouldn't really change much.

For big stores it may not change, but for small stores you could get away with fewer files. The problem with a large number of partitions and small stores, is that Voldemort assumes there is always at least one chunk per partition, even if it's an empty one.


On Fri, Oct 21, 2016 at 1:32 AM, David Ongaro <bitt...@gmail.com> wrote:
But it's always a hassle to change the cluster.xml of a cluster in production because it also has to be synced with new stores. So maybe just changing it when building a new cluster (like you suggest) would be the easiest.

That's right. Fortunately, in the Voldemort Read-Only architecture, it is fairly easy to migrate stores from one cluster to another without any disruption to online traffic. Since the data is immutable, you can simply push to both your old and new cluster, and continue doing so until all clients have migrated (which should be a simple reconfiguration). Once all clients are migrated, then you can stop pushing to the old cluster.
Reply all
Reply to author
Forward
0 new messages