fsbacked.Storage

93 views
Skip to first unread message

Bob Glickstein

unread,
Nov 27, 2019, 5:57:55 PM11/27/19
to per...@googlegroups.com

I've just opened a WIP PR against Perkeep in the hope of motivating a design discussion (about local blob storage, max blob sizes, and perhaps other topics). Here's the PR description. Cheers!


This PR is a preliminary sketch for a new blobserver type that uses files uploaded to it as their own storage.

When you add a file to an fsbacked.Storage that's within the directory tree it controls, an entry is added to a database that maps between files and blobrefs; but the file's contents are not copied anywhere. When fetching the file's content blob later, the database directs the Storage to the right local file and the data is served from there.

Adding files outside the directory tree, or adding any other kind of blob, fails over to another blobserver nested inside the fsbacked.Storage.

This solves the problem of wanting to add a tree of large files (e.g., videos of my kids growing up) to a local Perkeep instance without storing all the data twice. This should be used only on directory trees whose files do not change, lest the blobrefs in the database become mismatched to their corresponding files.

A number of other changes throughout Perkeep would be needed to make this truly useful. The io.Reader presented to a blobserver's ReceiveBlob method is usually (always?) some wrapper object (like checkHashReader) that conceals the underlying *os.File, without which fsbacked.Storage cannot detect that a file within its tree is being uploaded. And in any case, Perkeep imposes rather a low limit on blob sizes for this purpose.

Presented for further discussion.

Joe Moore

unread,
Apr 2, 2020, 4:30:42 PM4/2/20
to Perkeep
I'm new here, but interested.

So if your file is small enough (doesn't trip the rolling hash blob-breaking function) the reference will be to the file on disk.

Can you just tune the hash function so that it never sees the need to split your blob?  I don't know if different storage backends would be able to have different splitting behaviors.

--Joe

Markus Peröbner

unread,
Apr 6, 2020, 9:26:39 AM4/6/20
to per...@googlegroups.com
I guess the rolling hash produces small chunks by intention. The perkeep
source code mentions 16MB as maximum chunk size in some places.

Splitting the file is the intended behavior. It has some advantages
compared to just hashing a complete file. The most important advantage
is probably to upload for example multiple virtual machine images which
just differ in some places within the image. The rolling checksums
should make sure that most of the common parts in the different files
will only be stored once on the underlying storage backend in order to
save disk space.
> --
> You received this message because you are subscribed to the Google
> Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to perkeep+u...@googlegroups.com
> <mailto:perkeep+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/perkeep/1f4197a2-00b3-4460-bcb9-e30a9f939066%40googlegroups.com
> <https://groups.google.com/d/msgid/perkeep/1f4197a2-00b3-4460-bcb9-e30a9f939066%40googlegroups.com?utm_medium=email&utm_source=footer>.

Joe Moore

unread,
Apr 8, 2020, 4:16:35 PM4/8/20
to Perkeep
On Monday, April 6, 2020 at 9:26:39 AM UTC-4, markus.peroebner wrote:
I guess the rolling hash produces small chunks by intention. The perkeep
source code mentions 16MB as maximum chunk size in some places.

Splitting the file is the intended behavior. It has some advantages
compared to just hashing a complete file. 

Yes, this goal is clear from the perkeep docs.  Blobs are split up into chunks.

If we can tune when files get split (to optimize for different backend 
performance, for example) and if that tuning is backend-specific, then...

Could the fsbacked backend optimize for... say 64TB split sizes, and everything
smaller is stored as-is?

--Joe 
Reply all
Reply to author
Forward
0 new messages