hardware recommendation for s3ql

142 views
Skip to first unread message

Netflix Boundaries

unread,
May 7, 2013, 1:52:56 AM5/7/13
to s3...@googlegroups.com
Each time mount.s3ql process attempts to write or read through a sync-data process it starts consuming >90% of CPU >3G Memory itself.

Not counting other processes running on the system, this data was collected from 'top' and each spike duration depended on the s3ql_data file size (chunks) and took an average time between 2.5 and 3 seconds.

Fast calculation: during 45s of each min on different intervals up to 1s, CPU utilization is almost 100%. That's ~18 hours a day. Too much.

Based on this stats, which would be your hardware recommendation for a file system that will constantly be running rsync-type commands?

Current setup is 2 Computes, 3.75GB Memory and S3 is at same region.

Brad Knowles

unread,
May 7, 2013, 10:05:20 AM5/7/13
to Netflix Boundaries, Brad Knowles, s3...@googlegroups.com
On May 6, 2013, at 11:52 PM, Netflix Boundaries <netflix.b...@gmail.com> wrote:

> Each time mount.s3ql process attempts to write or read through a sync-data process it starts consuming >90% of CPU >3G Memory itself.
>
> Not counting other processes running on the system, this data was collected from 'top' and each spike duration depended on the s3ql_data file size (chunks) and took an average time between 2.5 and 3 seconds.

Unfortunately, top is not a particularly good tool for gathering this kind of information. Instead, take a look at iostat and the other related "*stat" programs.

> Fast calculation: during 45s of each min on different intervals up to 1s, CPU utilization is almost 100%. That's ~18 hours a day. Too much.
>
> Based on this stats, which would be your hardware recommendation for a file system that will constantly be running rsync-type commands?

That is a ... suboptimal ... use of s3ql. Keep in mind that this is a single-threaded program. And during a sync operation, it will probably suck up as much RAM and CPU as you have available, assuming that you have more data on disk that has changed since the last sync as compared to the RAM available to the process.

A normal rsync process would not be that different. It has to go out and calculate the hashes of all the new blocks and see what has changed, and then push those to the remote end.


Your use case sounds like Continuous Data Protection, or at least near-CDP. That's not a good model for s3ql. Consider that s3ql uses a back-end on S3 (or S3-compatible services), and that the primary goal for these kinds of services is maximum amounts of storage for minimum amounts of money -- which means that performance is not a primary concern. In fact, performance is not really any kind of concern with S3.

If you want CDP or near-CDP on an Amazon-hosted platform, I would recommend that you build your own service using striped EBS volumes as the backing store, and then a multi-threaded FUSE front-end that allows you to mount that backing store across the 'net. That would still have a huge bottleneck caused by your network latency to Amazon, but it would at least get you closer to near-CDP.

--
Brad Knowles <br...@shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Nikolaus Rath

unread,
May 7, 2013, 1:56:10 PM5/7/13
to s3...@googlegroups.com
On 05/06/2013 10:52 PM, Netflix Boundaries wrote:
> Each time mount.s3ql process attempts to write or read through a
> sync-data process it starts consuming >90% of CPU >3G Memory itself.

That's probably the lzma compression. If you want to use less resources,
you have to switch to e.g bzip2.


Best,
-Nikolaus

Netflix Boundaries

unread,
May 8, 2013, 3:18:43 AM5/8/13
to s3...@googlegroups.com, Netflix Boundaries, Brad Knowles
Thanks Brad

I think based on your statement that suboptimal use may apply then to ANY backup-related type of use. Encrypting or compressing sacrifices performance. We would need to give it more power (more spends) or remove featurings.

As this may impact a lot during first full backup, tendence could be to affect next backup if ongoing a lot of data is input to the system disregarding recurrency window (every 24h, every 12h, every week, etc). At this point, if we would like to skip any files w/ same size or time stamp, will result in investing in more resources (more money) through multiple processes in parallel in order to make the sacrifice balanced-enough to maintain performance.

Which drives me to my next POST. "s3ql_backup.sh - add multiple processes to the rsync command when trying to update a copy."

Brad Knowles

unread,
May 8, 2013, 9:57:50 AM5/8/13
to Netflix Boundaries, Brad Knowles, s3...@googlegroups.com
On May 8, 2013, at 1:18 AM, Netflix Boundaries <netflix.b...@gmail.com> wrote:

> I think based on your statement that suboptimal use may apply then to ANY backup-related type of use. Encrypting or compressing sacrifices performance. We would need to give it more power (more spends) or remove featurings.

No, s3ql should be fine for at least some, if not many, backup scenarios. However, it does optimize for space and not performance, in part because the network is usually one of the biggest bottlenecks in most backup scenarios. It also optimizes for cost of storage, because anyone who is using S3 as their backup medium is someone who wants pretty much the lowest possible cost of storage and is willing to sacrifice most everything else to get it.

You are right that the initial backup will take a long time to execute, and will spend a lot of CPU doing compression. That is true for just about all remote backup solutions that I know of.


My reference to Continuous Data Protection (or Near-CDP) is more about the different kind of backup solution that you seem to be shooting for, where you are pretty much constantly backing up all the data, as soon as the data changes.

CDP and Near-CDP are pretty much the current state-of-the-art for high-end solutions, but they are mega-expensive in terms of the amount of network bandwidth you have to have available between the front-end and back-end systems, and they are mega-expensive in terms of the amount of hardware required to make this kind of thing happen.

See <http://en.wikipedia.org/wiki/Continuous_data_protection> for more information on CDP. For specific products, see <http://www.backupcentral.com/wiki/index.php/Continuous_Data_Protection_%28CDP%29_Software> and <http://www.backupcentral.com/wiki/index.php/Near-Continuous_Data_Protection_%28Near-CDP%29_Software>.


So long as you're running a more traditional backup scenario, and you don't have too much data that changes too quickly, I would imagine that s3ql should be sufficient.

Of course, serving as a filesystem-based backup solution is not the primary intention of s3ql, so there are going to be limits to what you can feasibly achieve in this space.

> As this may impact a lot during first full backup, tendence could be to affect next backup if ongoing a lot of data is input to the system disregarding recurrency window (every 24h, every 12h, every week, etc). At this point, if we would like to skip any files w/ same size or time stamp, will result in investing in more resources (more money) through multiple processes in parallel in order to make the sacrifice balanced-enough to maintain performance.

Note that rsync is pretty good about detecting which files have or have not changed, and doing so with relatively minimal impact on the server and the client. It's the other things which suck up the CPU. The lzma compression algorithms will suck up more CPU than bzip, and bzip will suck up more than gzip. But in return, lzma usually gets just about the best compression currently available, and bzip gets better than gzip, but not as good as lzma.

But your available network bandwidth to your backup medium is usually the single biggest limiting factor for online backup solutions. In this case, that limit is on your side, between your server and S3.

Nikolaus Rath

unread,
May 8, 2013, 11:35:58 AM5/8/13
to s3...@googlegroups.com
On 05/08/2013 06:57 AM, Brad Knowles wrote:
> No, s3ql should be fine for at least some, if not many, backup scenarios. However, it does optimize for space and not performance, in part because the network is usually one of the biggest bottlenecks in most backup scenarios. It also optimizes for cost of storage, because anyone who is using S3 as their backup medium is someone who wants pretty much the lowest possible cost of storage and is willing to sacrifice most everything else to get it.

I don't think that's true. Amazon charges about 10 cents per GB per
month (US Standard, < 1 TB total). That's about $ 1.8k to store 500 GB
for 3 years. It'd be much cheaper to just buy a $ 500 GB harddisk for $
100.

In other words, if you really only care about price, S3 would be a
pretty bad choice.


Best,
Nikolaus

Netflix Boundaries

unread,
May 8, 2013, 6:32:47 PM5/8/13
to s3...@googlegroups.com, Netflix Boundaries, Brad Knowles
Thanks Brad for your input.

 When you say that " It also optimizes for cost of storage, because anyone who is using S3 as their backup medium is someone who wants pretty much the lowest possible cost of storage and is willing to sacrifice most everything else to get it.", I don't think that applies to everyone.

On our scenario, we don't see a quite difference in costs. As for example, provisioned storage (EBS Volumes at AWS), it's .10 per GB /month and US Virginia for S3 is 0.095 per GB /month (plus minor put,copy,gets, etc.. which at the ends may be the same as having EBS running).

The great difference here, is scalability. If we can scale at the same price, then that's our option. Limitation of up to 1TB per volume, plus the cost of making and maintaining arrays quickly drops any feasibility measure. This, without including the cost of implementing a very good backup solution that could hold consistency and restoration.

Here s3ql implies a very good role. Being honest enough, the biggest fear is to know how much s3ql can hold growing, instead of how much will cost the storage of the s3ql_data objects.  It's expected to grow over 8 million directory entries based on default chunks. Then we could say it's stable enough, but..hey, we don't know the limit but the behavior. And so far it's doing good. That's a risk we are willing to take, but isn't the same risk we have on daily basis when relying on storage-type solutions?

The clever thing here is to have this objects backed-up constantly-enough, for us to have a working copy of the filesystem ready to use. Then we can use that as the major breakpoint of it, and we can simply point our paths to a new file system. However, on this new case, we'll know when it will get broken.


For the rest, I totally agree. Furthermore, that final limitation between S3 and the server is downsized by having them in the same network and region.


Our big concern now, is how fast to copy new files. Have tried it bucket-to-bucket at S3 level, and so far it's consistent but slow on days on which objects grow 100k+. 
Definitely we need parallel processing, but not sure how to implement this at both S3 or s3ql_backup script level.



Brad Knowles

unread,
May 8, 2013, 7:31:15 PM5/8/13
to Nikolaus Rath, Brad Knowles, s3...@googlegroups.com
On May 8, 2013, at 9:35 AM, Nikolaus Rath <Niko...@rath.org> wrote:

> I don't think that's true. Amazon charges about 10 cents per GB per
> month (US Standard, < 1 TB total). That's about $ 1.8k to store 500 GB
> for 3 years. It'd be much cheaper to just buy a $ 500 GB harddisk for $
> 100.

With regards to online storage, they've still got one of the best cost/GB ratios. There are others that are cheaper, but not by that much. At least, not when you talk about large quantities of storage -- Dropbox may be "free" for 2GB, but it wouldn't be appropriate to compare that "free" price to the cost of S3 for 2GB.


Moreover, storage != hard drives. Specifically, storage >> hard drives.

The cost of bare hard drives does not include the cost of managing them. It does not include the cost of powering them. It does not include the cost of cooling them. It does not include the cost of providing redundancy.

Factor all those things in, and storage frequently costs ten times as much per GB as the raw hard drives do. Look at the price/performance comparisons of various vendors versus the "Backblaze" boxes (see <http://blog.backblaze.com/2013/02/20/180tb-of-good-vibrations-storage-pod-3-0/> and related articles). Even their price of "... a full rack of Storage Pods with 4 TB drives is $0.47 per TB" takes into account only the hardware costs of the storage pods, and not what it takes to power, cool, or manage them.

> In other words, if you really only care about price, S3 would be a
> pretty bad choice.

In terms of online storage options, I disagree. There are others that are lower cost, but S3 is still aimed at the "bargain basement" contingent for this category. They're just not at the bottom of the barrel of the bargain basement.

Nikolaus Rath

unread,
May 8, 2013, 11:54:27 PM5/8/13
to s3...@googlegroups.com
True. But your original statement (as I understood it) was that people
using S3 for backups do it because of the price. With that I disagree
both in general and in particular: I'm using S3 for my backups, and I'm
doing so mostly for convenience rather than price. I could easily put my
<< 1 TB of data on a couple of USB disks, and I'm pretty sure even
taking into account power this would be much cheaper. I chose Amazon
instead, because it gives me much better durability, offsite storage,
and I don't have to juggle external disks around.


Best,

-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Reply all
Reply to author
Forward
0 new messages