Canonical form of schema blobs?

Ian Denhardt

unread,

Jun 11, 2019, 4:41:07 PM6/11/19

to per...@googlegroups.com

Hey all,

I've got a large amount of data that I'm trying to import into perkeep,
from an old home-grown backup system that used hard links to
de-duplicate files based on hashes. As a consequence, pk put is moving
much more slowly than I think is achievable here, since it doesn't know
about the dedup scheme and is scanning each file every time through each
backup. Doing some back of the napkin calculations, at the rate it's
going it's just going to take way to long to be realistic. It got
through half of the data by disk usage pretty quick, and then slowed to
a crawl.

I think I can speed things up dramatically by writing a custom tool that
looks at inodes to determine if a file is already present, but I have a
concern: JSON doesn't have a canonical form, so I worry if I do this
naively perkeep will fail to use the same bit-for-bit representations
for all of the various schema blobs (varying in whitespace for example),
thus unnecessarily duplicating content.

My question is: what would I need to do to make sure that doesn't
happen, i.e. that my one-off tool ends up picking the same formatting
and such as pk put?

Thanks,

-Ian

ta...@gulacsi.eu

unread,

Jun 12, 2019, 6:16:03 AM6/12/19

to per...@googlegroups.com

I'd turn it around, and make your one-off peog call pk-put only on "genuine" files, store the hash, and pk-put a link on copies.

Hey all,

I've got a large amount of data that I'm trying to import into perkeep,
from an old home-grown backup system that used hard links to
de-duplicate files based on hashes. As a consequence, pk put is moving
much more slowly than I think is achievable here, since it doesn't know
about the dedup scheme and is scanning each file every time through each
backup. Doing some back of the napkin calculations, at the rate it's
going it's just going to take way to long to be realistic. It got
through half of the data by disk usage pretty quick, and then slowed to
a crawl.

I think I can speed things up dramatically by writing a custom tool that
looks at inodes to determine if a file is already present, but I have a
concern: JSON doesn't have a canonical form, so I worry if I do this
naively perkeep will fail to use the same bit-for-bit representations
for all of the various schema blobs (varying in whitespace for example),
thus unnecessarily duplicating content.

My question is: what would I need to do to make sure that doesn't
happen, i.e. that my one-off tool ends up picking the same formatting
and such as pk put?

Thanks,

-Ian

--
You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/156028539437.1203.10628641923295501163%40localhost.localdomain.
For more options, visit https://groups.google.com/d/optout.

Ian Denhardt

unread,

Jun 12, 2019, 12:25:05 PM6/12/19

to per...@googlegroups.com, ta...@gulacsi.eu

> pk-put a link on copies.

Can you elaborate on what you mean by this?

Tamás Gulácsi

unread,

Jun 13, 2019, 2:29:23 AM6/13/19

to per...@googlegroups.com

On Wed, Jun 12, 2019 at 12:20:36PM -0400, Ian Denhardt wrote:
> > pk-put a link on copies.
>
> Can you elaborate on what you mean by this?
>

> --
> You received this message because you are subscribed to the Google Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/156035643650.6388.6708542554667557276%40localhost.localdomain.

> For more options, visit https://groups.google.com/d/optout.

A permanode is

{
"camliVersion": 1,
"camliSigner": "sha224-5c1c77bc4df9eb027f7804b678f77e771a7a69989e626954480d272a",
"camliType": "permanode",
"claimDate": "2019-06-02T17:09:11.506488935Z",
"random": "Sc08fLUnCvyAuH/s98Jnzrgs9aU=",
"camliSig": "wsBcBAABCAAQBQJc9AK3CRAdaMP0l06jiwAAL0kIALriBpbm/NYjHtC99wpyn6+BCmbCNaOoOY3JEwpxlPnSeAp5MVX6nHWRKZFHFpEEZ/MWgS73+b7JdUTOoCQ5Rpaxu4JE1TIQtV67Ye7Q6M66g0pB5aImknMmR78Sy3sypyBZ/TQ6wy32eGRijftjpdjAGHghX4M3TbBGYZVzPHaeeh1VFcRiC9LiMF3FARce2E5YsHwZG8C6tIpvuji7AxqRhDbkT/IINC6UDFowM6WdOiNgYNDjyGSyhH3OoDH1Z3zV9svzmdv8sQpB7yTpk2VlTZcIX9g/+GEhusT8p+4gBk9bFtJ3T6dsCVIi2bdWpJbsWb3KJh88kg1/c0uLR2c==CQ29"
}

It's camliContent attribute may be sha224-4b36a6dd464e39a63bceebbc07ff094dff5b3c3b24d2ef20f632afba
which is

{
"camliVersion": 1,
"camliType": "file",
"fileName": "IMG_20190602_185859_HDR.jpg",
"parts": [
{
"blobRef": "sha224-c3b02935185579ff8d1eea773f33af33b89c21cf1993c4ab942a1bd0",
"size": 262144
},
{
"bytesRef": "sha224-882cc1b2a5d2139eb5b283e476ffc5929fd9a076f2bbff746e3efa43",
"size": 683332
},
{
"blobRef": "sha224-00413383533f266768d245f0eacb01efdc76ec0b7211ac182f4bb600",
"size": 67649
},
{
"blobRef": "sha224-385442355bc9c74935df5e9902e840dc23e78b2d9dc33f2f5f54c45d",
"size": 68654
},
{
"blobRef": "sha224-c60c0c7917b84edb8e3812ef4d9879d1074930e208a8e23fe26d5756",
"size": 80535
},
{
"blobRef": "sha224-e89465cef632f50294f1ba3b5fbcd83d18fb50161279149da8342403",
"size": 70209
},
{
"blobRef": "sha224-08afe52d1f5258cf2bd0b75f63f3f0e4aa45f04f0160f4349c777fce",
"size": 43070
}
]
}

a file.
And a file is just that: it has a version, a type, a fileName, and
parts.

Either you can duplicate the file by duplicating a file blob with
different fileName, or you can create another permanode with the same
camliContent ("pk-put permanode", then "pk-put attr <perma> camliContent
<content>").

Tamás Gulácsi

signature.asc

Ian Denhardt

unread,

Jun 13, 2019, 1:46:00 PM6/13/19

to Tamás Gulácsi, per...@googlegroups.com

Quoting Tamás Gulácsi (2019-06-13 02:29:16)

> Either you can duplicate the file by duplicating a file blob with
> different fileName

This was more or less my intent, but my question was more about how to
avoid formatting the "spine" of the filesystem in such a way that I get
a separate copy of all the file and directory blobs (but not the parts)
if pk put goes over it again by itself, rather than with my own tool.
In particular, at least these things can make two semantically identical
json blobs have different blobrefs:

* Differing whitespace/indentation
* Differing order of object fields

The output from perkeep looks like it was formatted using an
"encoding/json".Encoder configured with .SetIndent("", " ").
If I declare structs with fields declared in the same order as they
appear in the schema blobs, will that and the .SetIndent call be
sufficient to ensure that equivalent blobs have the same hash? (are
these structs declared publicly somewhere? I feel like they should be
but am having trouble finding them in the API docs).

-Ian

Reply all

Reply to author

Forward