Configurable BUFSIZE – good idea?

15 views
Skip to first unread message

Daniel Jagszent

unread,
Jul 3, 2019, 8:00:27 PM7/3/19
to s3ql
Hello Nikolaus,

currently S3QL uses a fixed buffer size of 64KB. With that buffer size I
can get upload speeds of 100 MBit/s. Not that bad. But when I increase
that constant from 64KB to e.g. 4MB the same setup can saturate a
1GBit/s network connection.

Do you have any reservations against making BUFSIZE configurable with an
extra argument to most of the command line tools so that – if one
chooses – can tailor the BUFSIZE to the system/use case.

I just wanted to ask in advance before going thru the hassle of creating
a pull request.

Nikolaus Rath

unread,
Jul 4, 2019, 4:09:53 AM7/4/19
to s3...@googlegroups.com
Thanks for checking! My immediate question is: is there a reason to not
just bump the hardcoded buffer size?

Adding an option typically means that 50% of the people go with the
default (even when it's sub-optimal), 40% pick a value that's even
worse, and 10% actually benefit. So I'd like to avoid options whenever
possible.

Did you check other buffer sizes or just 4 MB? There's probably an
optimum value that we could pick.


Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

Daniel Jagszent

unread,
Jul 4, 2019, 2:16:27 PM7/4/19
to s3ql
Hi Nikolaus,

[...] is there a reason to not just bump the hardcoded buffer size?

Adding an option typically means that 50% of the people go with the
default (even when it's sub-optimal), 40% pick a value that's even
worse, and 10% actually benefit. So I'd like to avoid options whenever
possible.[...]
If changing the hard-coded BUFSIZE is an option, that's what I would prefer, too. That would increase the required memory for S3QL, tho. So when low-memory systems are a of any concern, increasing the BUFSIZE would be bad for them.

In this use case I use S3QL as a target for Bareos backups. Each backup is one single file that can get hundreds of GB big. Thus I chose a max-obj-size of 3GB, no compression (Bareos does that already) and a cache size of 100 GB. The file systems look like these (cache gets dropped between backup windows):

Directory entries:    4735
Inodes:               4737
Data blocks:          6211
Total data size:      6.35 TB
After de-duplication: 6.35 TB (100.00% of total)
After compression:    6.35 TB (99.94% of total, 99.94% of de-duplicated)
Database size:        1.91 MiB (uncompressed)
Cache size:           0 bytes, 0 entries
Cache size (dirty):   0 bytes, 0 entries
Queued object removals: 0

So there are relatively few objects/data blocks but they are 1GB on average. This is quite a different use case as than the default max-obj-size of 10MB.
Before bumping the BUFSIZE we definitely should benchmark with the default max-obj-size, too.

Looking at contrib/benchmark.py, I can probably change this to also benchmark different BUFSIZEs for the upload so that we can get some data from different configurations.


References:
https://github.com/python/cpython/commit/4f1903061877776973c1bbfadd3d3f146920856e increased buffer from 16KB to 64KB
https://blogs.blumetech.com/blumetechs-tech-blog/2011/05/faster-python-file-copy.html 10 MB buffer for large files

Nikolaus Rath

unread,
Jul 4, 2019, 3:20:12 PM7/4/19
to s3...@googlegroups.com
On Jul 04 2019, Daniel Jagszent <dan...@jagszent.de> wrote:
> Hi Nikolaus,
>
>> [...] is there a reason to not just bump the hardcoded buffer size?
>>
>> Adding an option typically means that 50% of the people go with the
>> default (even when it's sub-optimal), 40% pick a value that's even
>> worse, and 10% actually benefit. So I'd like to avoid options whenever
>> possible.[...]
> If changing the hard-coded BUFSIZE is an option, that's what I would
> prefer, too. That would increase the required memory for S3QL, tho. So
> when low-memory systems are a of any concern, increasing the BUFSIZE
> would be bad for them.

I'm not too concerned about that. I think any system were a 4 MB memory
increase is a concern probably can't run S3QL anyway.

> In this use case I use S3QL as a target for Bareos backups. Each
> backup is one single file that can get hundreds of GB big. Thus I
> chose a max-obj-size of 3GB, no compression (Bareos does that already)
> and a cache size of 100 GB.
[...]
> So there are relatively few objects/data blocks but they are 1GB on
> average. This is quite a different use case as than the default
> max-obj-size of 10MB.
> Before bumping the BUFSIZE we definitely should benchmark with the
> default max-obj-size, too.

I don't see why any of this would affect the optimum buffer size (as
long as it's under the maximum object size), but sure, why not :-).

> Looking at contrib/benchmark.py, I can probably change this to also
> benchmark different BUFSIZEs for the upload so that we can get some
> data from different configurations.

Again, no objections, but I think this may be more work than
required. I'd start by just measuring the effects of a few different
hardcoded sizes (e.g. 512 kB, 1 MB, 2 MB, 4 MB, 8 MB).
Reply all
Reply to author
Forward
0 new messages