Mistaking JSON for Metadata, and Deduplication Question.

156 views
Skip to first unread message

Ralph Corderoy

unread,
Nov 10, 2018, 10:55:04 AM11/10/18
to per...@googlegroups.com
Hi,

I've just watched Brad and Mathieu's LinuxFest Northwest 2018 talk on
Perkeep, https://youtu.be/PlAU_da_U4s and have a couple of questions.

Say a `pk get $hash1' shows some Perkeep metadata as JSON. If I were to
`pk put' some text that was valid JSON Perkeep metadata then I assume
Perkeep initially treats it as if it were genuine when re-building the
index from just the blobs. Can problems be caused by it being faulty
metadata, e.g. an incorrect schema, or referring to blobs that don't
exist? If not, because those problems are ignored on the assumption it
wasn't real Perkeep-authored metadata after all, that would mean genuine
problems, e.g. caused by a bug, might go undetected at this stage?

I understand the rolling-checksum deduplication that Perkeep already
does. Are the resulting 0-16 MiB blobs ever compressed when stored?
Has any thought been given to deduplication at other granularities?
Given,

foo.png
bar.pdf has foo.png within it
xyzzy.mbox has a base64'd bar.pdf within it
xyzzy.mbox.gz is exactly a gzip'd xyzzy.mbox
all.tar has all the above

it's conceivable that some background process can continually look over
the blobs for dedupe opportunities. Is this something that could fit in
with Perkeep's model, or does the default lack of blob deletion (for
good reasons) get in the way?

Lastly, https://perkeep.org could benefit from having an up to date
`here's some of the things you could use it for' on the front page.
I can find https://perkeep.org/doc/uses but it's probably out of date
and doesn't touch on the tantalising interesting answers in the Q&A at
the end of the presentation. Read from three PKs, write to the one with
space, etc. An up to date list of importers would be good too as many
might arrive with a social-media site in mind, e.g. Google+ given it's
declared demise. Why I'd be interested in using it could be better sold
on the first page I reach.

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

Mathieu Lonjaret

unread,
Nov 10, 2018, 12:52:30 PM11/10/18
to per...@googlegroups.com
On Sat, 10 Nov 2018 at 16:55, Ralph Corderoy <ra...@inputplus.co.uk> wrote:
>
> Hi,

Hi,

> I've just watched Brad and Mathieu's LinuxFest Northwest 2018 talk on
> Perkeep, https://youtu.be/PlAU_da_U4s and have a couple of questions.
>
> Say a `pk get $hash1' shows some Perkeep metadata as JSON. If I were to
> `pk put' some text that was valid JSON Perkeep metadata then I assume
> Perkeep initially treats it as if it were genuine when re-building the
> index from just the blobs. Can problems be caused by it being faulty
> metadata, e.g. an incorrect schema, or referring to blobs that don't
> exist? If not, because those problems are ignored on the assumption it
> wasn't real Perkeep-authored metadata after all, that would mean genuine
> problems, e.g. caused by a bug, might go undetected at this stage?

Afair, when the index receives a blob, It checks whether it is a valid
claim. If not, it is simply ignored. Then, in some measure, it is also
checked whether the mutation introduced by the claim makes sense. If
not, it is ignored.
Does that answer your question?
If not, please try to propose a concrete example to demonstrate?

> I understand the rolling-checksum deduplication that Perkeep already
> does. Are the resulting 0-16 MiB blobs ever compressed when stored?

Not that I know of. Well it all depends on what kind of blobserver
implementation you use. For example, the blobpacked implementation
stores blobs pretty much like in a zip file. So I can imagine
compression could be enabled for these.

> Has any thought been given to deduplication at other granularities?
> Given,
>
> foo.png
> bar.pdf has foo.png within it
> xyzzy.mbox has a base64'd bar.pdf within it
> xyzzy.mbox.gz is exactly a gzip'd xyzzy.mbox
> all.tar has all the above
>
> it's conceivable that some background process can continually look over
> the blobs for dedupe opportunities. Is this something that could fit in
> with Perkeep's model, or does the default lack of blob deletion (for
> good reasons) get in the way?

I don't know.

> Lastly, https://perkeep.org could benefit from having an up to date
> `here's some of the things you could use it for' on the front page.
> I can find https://perkeep.org/doc/uses but it's probably out of date
> and doesn't touch on the tantalising interesting answers in the Q&A at
> the end of the presentation. Read from three PKs, write to the one with
> space, etc. An up to date list of importers would be good too as many
> might arrive with a social-media site in mind, e.g. Google+ given it's
> declared demise. Why I'd be interested in using it could be better sold
> on the first page I reach.

Sure. Contributions are welcome. :-)

> --
> Cheers, Ralph.
> https://plus.google.com/+RalphCorderoy
>
> --
> You received this message because you are subscribed to the Google Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ralph Corderoy

unread,
Nov 13, 2018, 8:40:10 AM11/13/18
to per...@googlegroups.com
Hi Mathieu,

> > If not, because those problems are ignored on the assumption it
> > wasn't real Perkeep-authored metadata after all, that would mean
> > genuine problems, e.g. caused by a bug, might go undetected at this
> > stage?
>
> Afair, when the index receives a blob, It checks whether it is a valid
> claim. If not, it is simply ignored. Then, in some measure, it is also
> checked whether the mutation introduced by the claim makes sense. If
> not, it is ignored. Does that answer your question?

Yes thanks. I wanted to make sure I hadn't missed an alternative to the
two choices I could see given blobs have no `internal metadata' bit.

> For example, the blobpacked implementation stores blobs pretty much
> like in a zip file. So I can imagine compression could be enabled for
> these.
...
> > Lastly, https://perkeep.org could benefit from having an up to date
> > `here's some of the things you could use it for' on the front page.
> > I can find https://perkeep.org/doc/uses but it's probably out of
> > date and doesn't touch on the tantalising interesting answers in the
> > Q&A at the end of the presentation.
>
> Sure. Contributions are welcome. :-)

:-) But that's just it, I, and I suspect a lot of other passers-by,
don't know what's possible and missing. Take `blobpacked' you mention
above. Google's first page of results for `perkeep blobpacked' doesn't
describe what it is unless it's package documentation at
https://perkeep.org/pkg/blobserver/blobpacked/

A list of: use cases, importers, and backends, with a sentence briefly
describing each would give a much better first-glance overview of
whether a visitor should stick around. Even links to
https://perkeep.org/pkg/blobserver/#pkg-subdirectories and
https://perkeep.org/pkg/importer/#pkg-subdirectories would help, but
they don't cover the use cases, e.g. small laptop as a local cache.

It seems Perkeep deserves to be better known and more widely used
amongst programmers, but the website is probably losing a lot of them at
the `top of the funnel'.

Simon B.

unread,
Nov 14, 2018, 4:15:18 PM11/14/18
to per...@googlegroups.com
@Ralph I'm interested in improving the documentation too. I've followed this project for years, but didn't manage to drum up support for adding more newbiew-user-targeted up front and center on the web pages, and don't yet understand enough to contribute much to the documentation that's (I guess?) generated from the source code. Possibly related, my own perkeep usage is also very limited.

I plan to create a hand-written javascript configuration if I find time to understand how the config format works. Tthere was also talk about changing config file format from .js(on) to .toml or another nicer option.

How the config files can be used is crucial to document, but before that, I agree we need a list of use-cases and some hint on how well-tested and supported various uses are.

Maybe we could start a wiki to hash out some ideas and then make a joint PR?

As I understood, you'd like to document / list:
- use cases
- importers
- backends


It seems Perkeep deserves to be better known and more widely used
amongst programmers, but the website is probably losing a lot of them at
the `top of the funnel'.

Yes. Let's fix this :)
Reply all
Reply to author
Forward
0 new messages