Advice on adding full text search/review latency

Ian Denhardt

unread,

Dec 3, 2019, 12:12:31 AM12/3/19

to per...@googlegroups.com

Hey all,

Two separate things I wanted to poke my head in and ask about.

First, I've noticed that shortly after the switch to GitHub pull
requests, the point of which was to lower the barrier of entry for
contributors, patches sortof stopped getting reviewed. I myself have 4
of them that have been open since the summer and haven't received any
comments[1]. I kinda wandered off after that, but Bob Glickstein's
recent work[2] (which has also not seen any review, except for my own
comment) got me thinking about Perkeep development again.

I know Brad has very little time, and not being paid anymore means
Mathieu has limited time to devote as well, and I don't want to demand
anything of anyone. I guess I'm just wanting to know what to expect
here?

---

Second, when I started working on Perkeep I wanted to implement full
text search. I still do, and I'm finding that without this my use of
Perkeep is much more limited than it might otherwise be, so I'm
increasingly itching to get back to this. But I have some uncertainties
regarding how to proceed, even assuming the review issue can be
solved.

The basic issue I'm hitting is: I'm having a really hard time figuring
out how to modify the indexer to support this. Attaching the Bleve
index itself isn't hard, but there's very little documentation on the
format of the index, and after a lot of staring at the implementations
of index, corpus, the search handler, and everything that touches those,
I still couldn't figure out how to go about integrating significant new
functionality like this.

I eventually concluded I could get a better sense of how this might work
by prototyping a new indexer implementation entirely. I got it to the
point where it indexes and does full text search for plain text files
and PDFs, but doesn't support any of the existing search predicates. The
WIP is here[4]. There are a few lingering design questions I have, but
the big thing is: I still don't know how to reconcile this with the main
indexer, I don't feel gung ho about suggesting a complete rewrite of the
indexer, and I don't feel like I even have enough of a clue about how to
extend the indexer to even be able to ask *specific* questions about how
it works; I don't know what to ask that could be answered more concisely
than someone writing a big overview doc describing at a high level the
format of the data in the index, how searches are actually executed,
etc. If it came down to it, I could just keep developing the separate
indexer implementation, and if Perkeep were abandoned and I was totally
on my own I might just go that route, as I've basically given up on
understanding how to work with the existing indexer by myself.

...I hate not having a more specific question than "any advice?" but I'm
coming up short here and I'd still like to make this happen if I can.
Any advice?

-Ian

[1]: https://github.com/perkeep/perkeep/pulls/zenhack
[2]: https://github.com/perkeep/perkeep/pull/1282
[3]: https://github.com/perkeep/perkeep/issues/580
[4]: https://github.com/zenhack/perkeep/tree/fulltext-index

Mathieu Lonjaret

unread,

Dec 3, 2019, 4:03:57 AM12/3/19

to per...@googlegroups.com

Hello Ian,

First off, and in case I fail to answer later, I want to say I am
sorry that I have not committed more to reviewing your work. I know
how it feels to be in your shoes.

And maybe one advice (and I speak only for myself here), on the short
term at least, would be to keep on collaborating with people
interested in the same topic, like you're doing with Bob at the
moment. I feel that might be the most productive way to go at the
moment.

Regards,
Mathieu

> --
> You received this message because you are subscribed to the Google Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/157534994703.15280.11560776550978742185%40localhost.localdomain.

Bob Glickstein

unread,

Dec 3, 2019, 10:48:16 AM12/3/19

to per...@googlegroups.com

Thanks Ian for bringing this up, and Mathieu for your response. I understand having limited bandwidth only too well.

While I'm learning to hack on Perkeep I'm happy to do it entirely in my fork of the repo and wait either for the main Perkeep project to kick back into gear or for a critical mass to coalesce around one fork or another. Of course I would be grateful for any code review, particularly since it would help bring me up to speed more quickly, but there's no urgency there.

For now I'm actually more interested to understand some of Perkeep's design decisions and whether they can be revisited and reconsidered.

For example, the design seems to want claims to modify permanodes, but some of the implementation seems not to care whether the target of a claim is a permanode or not. Can that constraint be explicitly removed?

For another example, claims and permanodes seem to need timestamps, but I quickly ran into cases where a timestamp is not desirable and ended up using time.Unix(0, 0) as a handy null value. I see that the core Perkeep code does that in one or two places too. How about making timestamps optional?

These examples are of interest in part because (as you may have seen) I have been playing around with importing highly structured static data as schema blobs. To that end I'd also like to explore design options that would allow schema blobs (rather than merely claims and permanodes) to protect one another from garbage collection.

(How might that look? Imagine a "schema+" blob with this structure: {camliVersion:..., camliType:..., payload:..., protects:[...]}. Here payload is whatever structure camliType says it is, and protects is a list of blobrefs reachable from this blob.)

Cheers,

- Bob

To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAHcDtn%3DKeQV_iqEsW2Y3S223qj%2B-9uPUkGu4Cr1Uw8-uUf8THQ%40mail.gmail.com.

Mathieu Lonjaret

unread,

Dec 3, 2019, 11:19:55 AM12/3/19

to per...@googlegroups.com

On Tue, 3 Dec 2019 at 16:48, Bob Glickstein <bob.gli...@gmail.com> wrote:
>
> Thanks Ian for bringing this up, and Mathieu for your response. I understand having limited bandwidth only too well.
>
> While I'm learning to hack on Perkeep I'm happy to do it entirely in my fork of the repo and wait either for the main Perkeep project to kick back into gear or for a critical mass to coalesce around one fork or another. Of course I would be grateful for any code review, particularly since it would help bring me up to speed more quickly, but there's no urgency there.
>
> For now I'm actually more interested to understand some of Perkeep's design decisions and whether they can be revisited and reconsidered.
>
> For example, the design seems to want claims to modify permanodes, but some of the implementation seems not to care whether the target of a claim is a permanode or not. Can that constraint be explicitly removed?

Given that a blob is immutable, I don't think it would make any sense
for the target to be anything other than a permanode. What are you
trying to do?

> For another example, claims and permanodes seem to need timestamps, but I quickly ran into cases where a timestamp is not desirable and ended up using time.Unix(0, 0) as a handy null value. I see that the core Perkeep code does that in one or two places too. How about making timestamps optional?

afair, the reason you need a timestamp on a permanode or a claim, is
because you need a date when you sign them. So, I'd have to look in
the code, but I think the places where you saw a zero time being used
was just that we defer the decision of which time to use. When the
claim or permanode actually gets signed, then we use something like
the time that was provided, or some other time found in e.g. the
related contents. So again, as a concept, I don't think a timestamp
can be optional for a permanode or a claim.

> To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAEf8c4_0WyaMB_tPnnenT6Ez-nRM%3DQaoNJ98GT8Ym%2BLqqqjLcw%40mail.gmail.com.

Ian Denhardt

unread,

Dec 3, 2019, 12:13:21 PM12/3/19

to Bob Glickstein, per...@googlegroups.com

Quoting Bob Glickstein (2019-12-03 10:48:01)

> For another example, claims and permanodes seem to need timestamps, but
> I quickly ran into cases where a timestamp is not desirable and ended
> up using time.Unix(0, 0) as a handy null value.

Can you clarify what the problem is?

The timestamp on a claim indicates when the claim was made, so the
correct value when creating one is typically "now."

> I see that the core Perkeep code does that in one or two places too.

I just did a grep of the source tree for 'time.Unix(0, *0)' and only
came up with stuff in vendor/ and two places where it was being used for
the SignatureTime on a permanode, but not claimDate on either permanode
or claims. Not sure what you're referring to?

> How about making timestamps optional?

For claims, this is problematic because searching permanodes as of a
given time depends on the claim dates; it works by just ignoring all
of the claims after that date:

https://perkeep.org/pkg/search/#PermanodeConstraint

For permanodes, the docs don't suggest that these are required?

> These examples are of interest in part because (as you may have seen) I
> have been playing around with importing highly structured static data
> as schema blobs. To that end I'd also like to explore design options
> that would allow schema blobs (rather than merely claims and
> permanodes) to protect one another from garbage collection.
> (How might that look? Imagine a "schema+" blob with this structure:
> {camliVersion:..., camliType:..., payload:..., protects:[...]}. Here
> payload is whatever structure camliType says it is, and protects is a
> list of blobrefs reachable from this blob.)

You may be interested in:

https://perkeep.org/doc/schema/keep

Ian Denhardt

unread,

Dec 3, 2019, 12:22:28 PM12/3/19

to Mathieu Lonjaret, per...@googlegroups.com

Thanks for your reply. I may just plug away at the new indexer on my own
for now.

-Ian

Quoting Mathieu Lonjaret (2019-12-03 04:03:42)

> To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/CAHcDtn%3DKeQV_iqEsW2Y3S223qj%2B-9uPUkGu4Cr1Uw8-uUf8THQ%40mail.gmail.com.

Bob Glickstein

unread,

Dec 5, 2019, 5:36:21 PM12/5/19

to Ian Denhardt, per...@googlegroups.com

On Tue, Dec 3, 2019 at 9:13 AM Ian Denhardt <i...@zenhack.net> wrote:

For permanodes, the docs don't suggest that [timestamps] are required?

Sorry to be unclear; I was speaking about making timestamps optional for permanodes only, not claims. My belief that they are required comes from the API for schema.Builder. For something like my pkmail tool, I have a permanode that all e-mail messages ever imported become "camliMembers" of. I need that permanode to be predictable. If I create a Builder with NewPlannedPermanode, my only options are to get an unsigned blob (with Builder.Blob) or a blob signed at "now" (with Builder.Sign) or a blob signed at a given time (with Builder.SignAt).

You may be interested in:

https://perkeep.org/doc/schema/keep

I'm interested in making blobs reachable (and indexable too) without needing separate claims. If I have a blob X that's a highly structured JSON object, where some of the JSON values are blobrefs, I'd like for Perkeep to consider the referenced blobs reachable if X is reachable regardless of any claims connecting X to other blobs. Fewer total blobs are needed this way, as long as no mutability is required, and complex trees of objects (like the headers, bodies, and attachments of a folder full of e-mail messages) can be reconstructed more easily and efficiently.

But maybe that's just not the Perkeep Way...

Cheers,

- Bob

Ian Denhardt

unread,

Dec 6, 2019, 9:33:24 PM12/6/19

to Bob Glickstein, bo...@emphatic.com, per...@googlegroups.com

Quoting Bob Glickstein (2019-12-05 17:36:08)

> I'm interested in making blobs reachable (and indexable too) without
> needing separate claims. If I have a blob X that's a highly structured
> JSON object, where some of the JSON values are blobrefs, I'd like for
> Perkeep to consider the referenced blobs reachable if X is reachable
> regardless of any claims connecting X to other blobs. Fewer total blobs
> are needed this way, as long as no mutability is required, and complex
> trees of objects (like the headers, bodies, and attachments of a folder
> full of e-mail messages) can be reconstructed more easily and
> efficiently.

My assumption has always been that the GC algorithm would follow
references in schema blobs, so you would only need to point a claim at
the "root", not every blob in the tree. Of course, GC isn't implemented
at all currently, so all of this is hypothetical.

-Ian

Reply all

Reply to author

Forward