Caching and block size in S3QL

183 views
Skip to first unread message

Ivan Shapovalov

unread,
Jul 8, 2020, 5:02:46 AM7/8/20
to s3...@googlegroups.com
Greetings.

Is it correct that s3ql will always download the whole block into the
local cache upon a read that intersects with this block?

If true, then how scalable is s3ql with respect to number of blocks in
the filesystem? That is, how far can I realistically reduce the block
size if my dataset is, say, 10-20 TB?

Basically, I'm trying to optimize for random reads.

--
Ivan Shapovalov / intelfx /

Cliff Stanford

unread,
Jul 8, 2020, 5:43:54 AM7/8/20
to s3...@googlegroups.com
On 08/07/2020 12:02, Ivan Shapovalov wrote:

> Is it correct that s3ql will always download the whole block into the
> local cache upon a read that intersects with this block?

A block is either a single file or part of a file, if the file exceeds
the maximum block size. There are never multiple files in a block and
blocks are variable size, up to the maximum.

> If true, then how scalable is s3ql with respect to number of blocks in
> the filesystem? That is, how far can I realistically reduce the block
> size if my dataset is, say, 10-20 TB?
>
> Basically, I'm trying to optimize for random reads.

I don't think that reducing the maximum block size will improve things
for you.

Regards
Cliff.


--
Cliff Stanford
London: +44 20 0222 1666 Swansea: +44 1792 469666
Spain: +34 603 777 666 Estonia: +372 5308 9666
UK Mobile: +44 7973 616 666

Ivan Shapovalov

unread,
Jul 8, 2020, 5:47:36 AM7/8/20
to Cliff Stanford, s3...@googlegroups.com
On 2020-07-08 at 12:43 +0300, Cliff Stanford wrote:
> On 08/07/2020 12:02, Ivan Shapovalov wrote:
>
> > Is it correct that s3ql will always download the whole block into
> > the
> > local cache upon a read that intersects with this block?
>
> A block is either a single file or part of a file, if the file
> exceeds
> the maximum block size. There are never multiple files in a block
> and
> blocks are variable size, up to the maximum.

That's clear enough. What I'm asking is that, for example, if I have
100MiB block size set at mkfs time and 1GiB file on the filesystem
(which is thus broken into 10 blocks, let's ignore dedup) and I'm
read()ing 10 MiB from the middle of the file, how much data will s3ql
download?
10 MiB or 100-200 MiB?

>
> > If true, then how scalable is s3ql with respect to number of blocks
> > in
> > the filesystem? That is, how far can I realistically reduce the
> > block
> > size if my dataset is, say, 10-20 TB?
> >
> > Basically, I'm trying to optimize for random reads.
>
> I don't think that reducing the maximum block size will improve
> things
> for you.

If what I conjectured above is true, then reducing block size to
average I/O size will obviously reduce download overhead, no?

Cliff Stanford

unread,
Jul 8, 2020, 5:57:44 AM7/8/20
to s3...@googlegroups.com
On 08/07/2020 12:47, Ivan Shapovalov wrote:

> That's clear enough. What I'm asking is that, for example, if I have
> 100MiB block size set at mkfs time and 1GiB file on the filesystem
> (which is thus broken into 10 blocks, let's ignore dedup) and I'm
> read()ing 10 MiB from the middle of the file, how much data will s3ql
> download?
> 10 MiB or 100-200 MiB?

Now you're beyond my knowledge. I had misunderstood your question.
Hopefully someone who knows better than I do will be able to answer.

Daniel Jagszent

unread,
Jul 8, 2020, 8:54:59 AM7/8/20
to s3...@googlegroups.com
Is it correct that s3ql will always download the whole block into the
local cache upon a read that intersects with this block?
yes
If true, then how scalable is s3ql with respect to number of blocks in
the filesystem? That is, how far can I realistically reduce the block
size if my dataset is, say, 10-20 TB?

Basically, I'm trying to optimize for random reads.
How many files/inodes does your dataset have? What is the average file size?

You should take your connection speed into account, also. If you are on a 1GB/s internet connection downloading 10MiB blocks (the default max block size) is probably fine. If you only have a 1MB/s connection your max block size should be smaller.
But keep in mind: not only the raw download speed is relevant but also the setup cost for each download request. The object storage backend needs to authenticate your request, look up your object and you maybe have a DNS/TLS/TCP slow start overhead. And you might also need to pay for each request.

S3QL uses a SQLite database to keep track of all things. A smaller block size means more blocks for S3QL to keep track of => a bigger database.
That SQLite database can get big. A compressed version of the database gets stored on the object storage as a backup. Currently there is a limitation, that this backup can only be 5GiB (the maximum object size of almost all object storage providers). If you have many blocks and many inodes you can reach this limit (search this list, one or two folks have had this problem) and can get in real trouble.

I probably would not choose a max block size below 10MiB.  I have some S3QL file systems with few big files (Bareos "tape" backups) for these file systems I have increased the max block size to 300MiB but these files only get accessed sequentially and the VMs running these file systems are on 1GB/s+ internet connections.

Ivan Shapovalov

unread,
Jul 8, 2020, 9:46:51 AM7/8/20
to Daniel Jagszent, s3...@googlegroups.com
On 2020-07-08 at 14:54 +0200, Daniel Jagszent wrote:
> > Is it correct that s3ql will always download the whole block into
> > the
> > local cache upon a read that intersects with this block?
> yes
> > If true, then how scalable is s3ql with respect to number of blocks
> > in
> > the filesystem? That is, how far can I realistically reduce the
> > block
> > size if my dataset is, say, 10-20 TB?
> >
> > Basically, I'm trying to optimize for random reads.
> How many files/inodes does your dataset have? What is the average
> file size?

This specific filesystem is used to store a borg repository.
All files are ~500 MiB in size. Consequently, I expect ~20-40k of these
files.

I have not (yet) profiled the file access patterns exactly, but I know
that all new writes are strictly sequential and files are never
rewritten, but accessing a borg repository causes many small random
reads with no discernible pattern.

(it's kinda ironic that I use a copy-on-write chunking globally
deduplicating filesystem to store a copy-on-write chunking globally
deduplicating content archiving repository, but this is the only S3
filesystem that works satisfactorily...)
Has anybody tried to improve s3ql's caching mechanism to allow partial
download of blocks?

David Gasaway

unread,
Jul 8, 2020, 6:39:44 PM7/8/20
to int...@intelfx.name, Daniel Jagszent, s3...@googlegroups.com
On Wed, Jul 8, 2020 at 6:46 AM Ivan Shapovalov <int...@intelfx.name> wrote:
 
This specific filesystem is used to store a borg repository.

I currently use borg + s3ql myself.  I don't think I'd recommend it.  In my experience, borg check operations don't fare well in this scenario (as in, downloads of the entire repository).  I plan to switch to s3ql_backup.py + s3ql, but the prospect of the uploads needed to start from scratch are daunting. 

--
-:-:- David K. Gasaway
-:-:- Email: da...@gasaway.org

Ivan Shapovalov

unread,
Jul 8, 2020, 7:59:38 PM7/8/20
to David Gasaway, s3...@googlegroups.com
On 2020-07-08 at 15:39 -0700, David Gasaway wrote:
> On Wed, Jul 8, 2020 at 6:46 AM Ivan Shapovalov <int...@intelfx.name>
> wrote:
>
>
> > This specific filesystem is used to store a borg repository.
> >
>
> I currently use borg + s3ql myself. I don't think I'd recommend
> it. In my
> experience, borg check operations don't fare well in this scenario
> (as in,
> downloads of the entire repository). I plan to switch to
> s3ql_backup.py +
> s3ql, but the prospect of the uploads needed to start from scratch
> are
> daunting.
>

If I understand you right, then a `borg check` _has_ to download the
entire repository, because, well, it's an integrity check :)

But yeah, I don't hold any illusions — though I don't see much of a
choice either. I have a 8+ TB dataset containing some VMs, some random
assorted small files which come and go, some user shares with arbitrary
data and I need to snapshot+backup all that stuff for point-in-time
recovery.

Using S3QL's built-in quasi-snapshotting is a possibility, but I'm not
sure if rsync is going to be to any efficient with delta transfers of
VM images.

I guess I'll try my luck with various block sizes, see how large the
metadata grows and how efficient borg reads will be. I still need to
profile borg I/O patterns rigorously, but I believe that small random
reads are the main issue. Worst case I guess I'll have to write a
custom caching layer for partial blocks, maybe borg-aware.

David Gasaway

unread,
Jul 8, 2020, 8:24:05 PM7/8/20
to s3...@googlegroups.com
On Wed, Jul 8, 2020 at 4:59 PM Ivan Shapovalov <int...@intelfx.name> wrote:
On 2020-07-08 at 15:39 -0700, David Gasaway wrote:
> On Wed, Jul 8, 2020 at 6:46 AM Ivan Shapovalov <int...@intelfx.name>
> wrote:
>
>
> > This specific filesystem is used to store a borg repository.
> >
>
> I currently use borg + s3ql myself.  I don't think I'd recommend
> it.  In my
> experience, borg check operations don't fare well in this scenario
> (as in,
> downloads of the entire repository).  I plan to switch to
> s3ql_backup.py +
> s3ql, but the prospect of the uploads needed to start from scratch
> are
> daunting.
>

If I understand you right, then a `borg check` _has_ to download the
entire repository, because, well, it's an integrity check :)

I'm saying that borg makes certain assumptions about the repository storage space that don't necessarily mesh with the class of filesystems to which s3ql belongs.  So, conditions which might be relatively unremarkable for s3ql can require a full borg check to remedy.  Something may go wrong which really only impacts a single archive, a small amount of data when deduplicated, but a full repository repair is required to get borg to even recognize the issue.  Find my posts on the borg mailing list if you're curious.

Daniel Jagszent

unread,
Jul 8, 2020, 10:25:52 PM7/8/20
to int...@intelfx.name, s3...@googlegroups.com
[...] This specific filesystem is used to store a borg repository.
All files are ~500 MiB in size. Consequently, I expect ~20-40k of these
files.[...]

this should not be a problem regarding the S3QL database size. (I suspect the uncompressed size of the DB would be < 100 MiB)


I have not (yet) profiled the file access patterns exactly, but I know
that all new writes are strictly sequential and files are never
rewritten, but accessing a borg repository causes many small random
reads with no discernible pattern.
I do not know the specifics of the content-defined chunking borg uses but when it is is similar to restic's implementation ( https://godoc.org/github.com/restic/chunker ) then chunks will be between 512KiB and 8MiB. Let's say that compression can reduce that 2 times. So the chunks borg needs to access are between ~256KiB and ~4MiB. Then maybe a max S3QL block size of 5 MiB instead of the default of 10 MiB would be better.  Since your file system has relatively few inodes (~40k inodes, ~40k names, 4M blocks) this should be OK for the S3QL database.

Besides max block size what really really would improve the random read/write performance is a dedicated SSD for the S3QL cache.


[...] Has anybody tried to improve s3ql's caching mechanism to allow partial
download of blocks?
Not that I know of. Any such implementation should be rock solid with regards to data integrity (that has priority over performance for S3QL AFAIK) and should survive an OS crash at any point in time.
Since blocks are (optionally) compressed and encrypted it's not that easy to discern the required byte range to receive from the object storage…


Ivan Shapovalov

unread,
Jul 9, 2020, 7:05:24 AM7/9/20
to Daniel Jagszent, s3...@googlegroups.com
On 2020-07-09 at 04:25 +0200, Daniel Jagszent wrote:
> > [...] This specific filesystem is used to store a borg repository.
> > All files are ~500 MiB in size. Consequently, I expect ~20-40k of
> > these
> > files.[...]
>
> this should not be a problem regarding the S3QL database size. (I
> suspect the uncompressed size of the DB would be < 100 MiB)
>
> > I have not (yet) profiled the file access patterns exactly, but I
> > know
> > that all new writes are strictly sequential and files are never
> > rewritten, but accessing a borg repository causes many small random
> > reads with no discernible pattern.
> I do not know the specifics of the content-defined chunking borg uses
> but when it is is similar to restic's implementation (
> https://godoc.org/github.com/restic/chunker ) then chunks will be
> between 512KiB and 8MiB. Let's say that compression can reduce that 2
> times. So the chunks borg needs to access are between ~256KiB and
> ~4MiB.
> Then maybe a max S3QL block size of 5 MiB instead of the default of
> 10
> MiB would be better. Since your file system has relatively few
> inodes
> (~40k inodes, ~40k names, 4M blocks) this should be OK for the S3QL
> database.

That's true for data chunks, but metadata chunks are smaller. I still
have no hard profile data (bpf doesn't like me), but plain stracing
shows that pruning/deleting old archives with borg (metadata heavy
operation) performs a scatter of 128 KiB reads every 1-10 MiB, which
naturally results in huge read amplification. Pruning two archives from
a test 600 GiB repository has just crossed 100 GiB total I/O.

Not sure if this can be solved with reducing S3QL block size...

>
> Besides max block size what really really would improve the random
> read/write performance is a dedicated SSD for the S3QL cache.

Yes, I do have an SSD for the cache — not dedicated, though.

Ivan Shapovalov

unread,
Jul 9, 2020, 7:06:11 AM7/9/20
to Daniel Jagszent, s3...@googlegroups.com
On 2020-07-09 at 04:25 +0200, Daniel Jagszent wrote:
> > [...] Has anybody tried to improve s3ql's caching mechanism to
> > allow partial
> > download of blocks?
> Not that I know of. Any such implementation should be rock solid with
> regards to data integrity (that has priority over performance for
> S3QL
> AFAIK) and should survive an OS crash at any point in time.
> Since blocks are (optionally) compressed and encrypted it's not that
> easy to discern the required byte range to receive from the object
> storage…

Ah yes, compression and probably encryption will indeed preclude any
sort of partial block caching. An implementation will have to be
limited to plain uncompressed blocks, which is okay for my use-case
though (borg provides its own encryption and compression anyway).

Regarding stability, I don't see why such a cache would be inherently
unstable -- it adds complexity, yes, but so does everything else.

I was thinking of doing something on top of sparse files:

- every partially cached block gets its own sparse file
- every sparse file gets a map file that records which data is present
in form of a list of ranges
- sparse files get a different naming schema (e. g. ${blockid}.partial)
to prevent S3QL ever mistaking a partially downloaded block for a fully
cached one, and map files as well (e. g. ${blockid}.map)
- application reads are compared against the existing map to determine
if new data has to be downloaded
- backend reads are probably aligned down and rounded up to minimal
viable I/O size (configurable), then passed through as partial
downloads
- the sparse file is allocated if it does not exist (open, truncate),
the range is downloaded into the file as is (seek, write) and the file
is fsynced
- a new map file is created (${blockid}.map.new), new ranges are
serialized and the map file is atomically renamed to ${blockid}.map

Naturally, there has to be some logic to determine if it is worthwhile
to perform a partial download at all, as well as logic to promote
partial blocks to full blocks.

Does this all sound plausible?

Daniel Jagszent

unread,
Jul 10, 2020, 9:41:36 AM7/10/20
to s3...@googlegroups.com
[...] Not sure if this can be solved with reducing S3QL block size...
I do not think so – you would need to reduce the max block size to 128KiB – do not do that!

Your best option probably will be to use a huge S3QL cache (on SSD) and try to keep that cache as long as you can (option "--keep-cache" see https://www.rath.org/s3ql-docs/mount.html#mounting)
Eventually your metadata will be all cached and you will have near local performance.


Daniel Jagszent

unread,
Jul 10, 2020, 9:41:39 AM7/10/20
to s3...@googlegroups.com



Ah yes, compression and probably encryption will indeed preclude any
sort of partial block caching. An implementation will have to be
limited to plain uncompressed blocks, which is okay for my use-case
though (borg provides its own encryption and compression anyway).
[...]
Compression and encryption are integral parts of S3QL and I would argue that disabling them is only an edge case.
I might be wrong but I think Nikolaus (maintainer of S3QL) will not accept such a huge change into S3QL that is only beneficial for an edge case.


Since you do not need compression and encryption (handled by borg anyways) have you tried something with less abstractions?
Does https://github.com/kahing/goofys work for you? (maybe stack a https://github.com/kahing/catfs  on top)



Ivan Shapovalov

unread,
Jul 10, 2020, 9:47:29 AM7/10/20
to Daniel Jagszent, s3...@googlegroups.com
On 2020-07-10 at 15:41 +0200, Daniel Jagszent wrote:
> > [...] Not sure if this can be solved with reducing S3QL block
> > size...
> I do not think so – you would need to reduce the max block size to
> 128KiB – do not do that!

Indeed, I understand it won't fly.

>
> Your best option probably will be to use a huge S3QL cache (on SSD)
> and
> try to keep that cache as long as you can (option "--keep-cache" see
> https://www.rath.org/s3ql-docs/mount.html#mounting)
> Eventually your metadata will be all cached and you will have near
> local
> performance.

As I see it, borg apparently does not keep its metadata tightly grouped
-- metadata chunks may end up in completely random segments. I have 100
GiB allocated for s3ql cache and it does not appear to help much.
signature.asc

Cliff Stanford

unread,
Jul 10, 2020, 10:06:40 AM7/10/20
to s3...@googlegroups.com
On 10/07/2020 16:41, Daniel Jagszent wrote:

> Compression and encryption are integral parts of S3QL and I would argue
> that disabling them is only an edge case.
> I might be wrong but I think Nikolaus (maintainer of S3QL) will not
> accept such a huge change into S3QL that is only beneficial for an edge
> case.

You can disable encryption with "--plain" at mkfs time. You can disable
compression with "--compress none" at mount time.

Not sure how it helps though.

Ivan Shapovalov

unread,
Jul 10, 2020, 10:08:32 AM7/10/20
to Daniel Jagszent, s3...@googlegroups.com
On 2020-07-10 at 15:41 +0200, Daniel Jagszent wrote:
>
>
> > Ah yes, compression and probably encryption will indeed preclude
> > any
> > sort of partial block caching. An implementation will have to be
> > limited to plain uncompressed blocks, which is okay for my use-case
> > though (borg provides its own encryption and compression anyway).
> > [...]
> Compression and encryption are integral parts of S3QL and I would
> argue
> that disabling them is only an edge case.
> I might be wrong but I think Nikolaus (maintainer of S3QL) will not
> accept such a huge change into S3QL that is only beneficial for an
> edge
> case.

I see, but I was rather asking about the technical side of things,
whether it sounds sane or I'm missing something.

Provided I do work on partial cache in the end, if Nikolaus decides to
accept that work into upstream -- great, if not -- I will carry it in a
downstream fork.

(I will probably need a fork anyway, as Nikolaus has apparently
rejected a specific optimization in the B2 backend, absence of which
makes my s3ql hit a certain API rate limit very often.)

>
>
> Since you do not need compression and encryption (handled by borg
> anyways) have you tried something with less abstractions?
> Does https://github.com/kahing/goofys work for you? (maybe stack a
> https://github.com/kahing/catfs on top)

Yes, but no. S3QL is the only S3 filesystem I'm aware of that provides
consistency guarantees in some form. goofys appears to work, but it
only does work by virtue of having a single thread and a single backend
connection (consequently, it ends up being very slow), whereas catfs
has too many "will eat your data" disclaimers for my liking.
signature.asc

Nikolaus Rath

unread,
Jul 10, 2020, 2:51:36 PM7/10/20
to s3...@googlegroups.com
On Jul 10 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
> (I will probably need a fork anyway, as Nikolaus has apparently
> rejected a specific optimization in the B2 backend, absence of which
> makes my s3ql hit a certain API rate limit very often.)

Hu? S3QL does not have a B2 backend at all, so I don't think I could
have rejected optimizations for it.

Best,
Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

Nikolaus Rath

unread,
Jul 10, 2020, 2:54:54 PM7/10/20
to s3...@googlegroups.com
On Jul 10 2020, Daniel Jagszent <dan...@jagszent.de> wrote:
>> Ah yes, compression and probably encryption will indeed preclude any
>> sort of partial block caching. An implementation will have to be
>> limited to plain uncompressed blocks, which is okay for my use-case
>> though (borg provides its own encryption and compression anyway).
>> [...]
> Compression and encryption are integral parts of S3QL and I would argue
> that disabling them is only an edge case.

If I were to write S3QL from scratch, I would probably not support this
at all, right. However, since the feature is present, I think we ought
to consider it fully supported ("edge case" makes it sound as if this
isn't the case).


> I might be wrong but I think Nikolaus (maintainer of S3QL) will not
> accept such a huge change into S3QL that is only beneficial for an edge
> case.

Never say never, but the bar is certainly high here. I think there are
more promising avenues to explore - eg. storing the
compressed/uncompressed offset mapping to make partial retrieval work
for all cases.

Best,
-Nikolaus

Ivan “intelfx” Shapovalov

unread,
Jul 10, 2020, 3:06:12 PM7/10/20
to Nikolaus Rath, s3...@googlegroups.com


10 июля 2020 г., в 21:51, Nikolaus Rath <Niko...@rath.org> написал(а):

On Jul 10 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
(I will probably need a fork anyway, as Nikolaus has apparently
rejected a specific optimization in the B2 backend, absence of which
makes my s3ql hit a certain API rate limit very often.)

Hu? S3QL does not have a B2 backend at all, so I don't think I could
have rejected optimizations for it.

Then how am I using it? :)
https://github.com/s3ql/s3ql/pull/116

--
Ivan Shapovalov / intelfx /

(Sent from a phone. Havoc may be wreaked on the formatting.)

Nikolaus Rath

unread,
Jul 10, 2020, 3:09:20 PM7/10/20
to s3...@googlegroups.com
On Jul 10 2020, Ivan “intelfx” Shapovalov <int...@intelfx.name> wrote:
>> 10 июля 2020 г., в 21:51, Nikolaus Rath <Niko...@rath.org> написал(а):
>>
>> On Jul 10 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
>>> (I will probably need a fork anyway, as Nikolaus has apparently
>>> rejected a specific optimization in the B2 backend, absence of which
>>> makes my s3ql hit a certain API rate limit very often.)
>>
>> Hu? S3QL does not have a B2 backend at all, so I don't think I could
>> have rejected optimizations for it.
>
> Then how am I using it? :)
> https://github.com/s3ql/s3ql/pull/116

I stand corrected.. I guess I didn't do a release since that so the
documentation isn't updated yet. Apologies.

That said, my point about the optimization stands, I do not remember
rejecting anything here. Do you have a link for that too? :-)

Best,
-Nikolaus

Ivan “intelfx” Shapovalov

unread,
Jul 10, 2020, 3:24:18 PM7/10/20
to Nikolaus Rath, s3...@googlegroups.com

10 июля 2020 г., в 22:09, Nikolaus Rath <Niko...@rath.org> написал(а):

On Jul 10 2020, Ivan “intelfx” Shapovalov <int...@intelfx.name> wrote:
10 июля 2020 г., в 21:51, Nikolaus Rath <Niko...@rath.org> написал(а):

On Jul 10 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
(I will probably need a fork anyway, as Nikolaus has apparently
rejected a specific optimization in the B2 backend, absence of which
makes my s3ql hit a certain API rate limit very often.)

Hu? S3QL does not have a B2 backend at all, so I don't think I could
have rejected optimizations for it.

Then how am I using it? :)
https://github.com/s3ql/s3ql/pull/116

I stand corrected.. I guess I didn't do a release since that so the
documentation isn't updated yet. Apologies.

That said, my point about the optimization stands, I do not remember
rejecting anything here. Do you have a link for that too? :-)

I wasn’t entirely correct, too: you didn’t strictly reject the feature, but rather scared the contributor into not doing it :)


It would appear that’s not a theoretical issue — I’m hitting b2_get_file_versions cap from time to time.

Ivan Shapovalov

unread,
Jul 10, 2020, 5:29:21 PM7/10/20
to Nikolaus Rath, s3...@googlegroups.com
Hmm, I'm not sure how's that supposed to work.

AFAICS, s3ql uses "solid compression", meaning that the entire block is
compressed at once. It is generally impossible to extract a specific
range of uncompressed data without decompressing the whole stream.[1]

Encryption does not pose this kind of existential problem — AES is used
in CTR mode, which theoretically permits random-access decryption — but
the crypto library in use, python-cryptography, doesn't seem to permit
this sort of trickery.

[1]: This can be solved by converting compression layer into a block-
based one, but this will naturally break compatibility (i. e. we will
have to introduce a new set of compression algorithms, that is, another
corner case) and will require to either compromise on the block size,
or introduce complex indirection (such as storing compressed-
uncompressed offset maps along with the object itself), or completely
blow metadata out of proportion (recording an offset mapping for each
128K of the data). Regardless of the implementation plan this will also
compromise the compression efficiency. Completely not worth it, IMO.
signature.asc

Nikolaus Rath

unread,
Jul 11, 2020, 7:13:58 AM7/11/20
to s3...@googlegroups.com
At least bzip2 always works in blocks, IIRC blocks are at most 900 kB
(for highest compression settings). I wouldn't be surprised if the same
holds for LZMA.

We could track the size of each compressed block, and store it as part
of the metadata of the object (so it doesn't blow-up the SQLite table).

> Encryption does not pose this kind of existential problem — AES is used
> in CTR mode, which theoretically permits random-access decryption — but
> the crypto library in use, python-cryptography, doesn't seem to permit
> this sort of trickery.

Worst case you can feed X bytes of garbage into the decrypter and then
start with the partial block - with CTR you should get the right
output.

Best,

Ivan Shapovalov

unread,
Jul 11, 2020, 12:22:19 PM7/11/20
to Nikolaus Rath, s3...@googlegroups.com
True, I forgot that bzip2 is inherently block-based. Not sure about
LZMA or gzip, but there is still a significant obstacle: how would you
extract this information from the compression libraries?

>
> We could track the size of each compressed block, and store it as
> part
> of the metadata of the object (so it doesn't blow-up the SQLite
> table).
>
> > Encryption does not pose this kind of existential problem — AES is
> > used
> > in CTR mode, which theoretically permits random-access decryption —
> > but
> > the crypto library in use, python-cryptography, doesn't seem to
> > permit
> > this sort of trickery.
>
> Worst case you can feed X bytes of garbage into the decrypter and
> then
> start with the partial block - with CTR you should get the right
> output.

Yes, that could probably work. Still feels like a grand hack.
signature.asc

Nikolaus Rath

unread,
Jul 15, 2020, 6:12:27 AM7/15/20
to s3...@googlegroups.com
No need to extract it, S3QL hands data to the compression library in
smaller chunks (IIRC 128 kB), so we just have to keep track of what goes
into and comes out of the compression library.


Best,
-Nikolaus
Reply all
Reply to author
Forward
0 new messages