Kythe Index update process

78 views
Skip to first unread message

程洋

unread,
Jun 4, 2021, 4:03:56 AM6/4/21
to Kythe
Hi, i'm stilling working on indexing AOSP and made some progress.
But now i have 2 questions and i'm not sure what is the best approach

1. just like i mentioned in another thread, AOSP project totally have 310k kzips which takes considerable time to process. And if someone commit a change into my AOSP project, and i need to update, how can i do it with the minium work? It's impossible to index the whole project everytime someone commit change.

2. my company produces many phone products, each product uses one branch of the AOSP project. However most of files are the same, it's obvious waste of time and space to store serving table for each product, is there anyway to store different branches of the index?

Salvatore Guarnieri

unread,
Jun 4, 2021, 1:41:57 PM6/4/21
to 程洋, Kythe
Those are two interesting questions because there are no prescribed solutions for them yet but they are important real world problems.

1) We believe that one solution to this might be to identify what portions of the graph are invalidated by a change and recompute that portion of the graph. However, like I said, we haven't implemented this and don't know how difficult the implementation would be.

2) This may be somewhat similar to 1) since you are trying to identify what parts of the graph are impacted by the changes in each branch but, again, we don't have any implementation to accomplish this.

We would gladly welcome any contributions to achieve these two goals.

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/40bc1abb-6382-4481-b842-c9427dd12a91n%40googlegroups.com.

Shahms King

unread,
Jun 4, 2021, 1:46:44 PM6/4/21
to 程洋, Kythe
On Fri, Jun 4, 2021 at 1:03 AM 程洋 <d171...@gmail.com> wrote:
Hi, i'm stilling working on indexing AOSP and made some progress.
But now i have 2 questions and i'm not sure what is the best approach

1. just like i mentioned in another thread, AOSP project totally have 310k kzips which takes considerable time to process. And if someone commit a change into my AOSP project, and i need to update, how can i do it with the minium work? It's impossible to index the whole project everytime someone commit change.

It's not presently possible to incrementally update an index.  While it's something we'd like, we don't have plans at present to address it.
 

2. my company produces many phone products, each product uses one branch of the AOSP project. However most of files are the same, it's obvious waste of time and space to store serving table for each product, is there anyway to store different branches of the index?

Sure, although the support for this can have some problems. The general solution is to give each branch a distinct corpus, index the changed files (and likely their dependents) with that corpus, using vnames.json to map unchanged files to the original corpus.  There are going to be some gaps and oddities, however, if there isn't a clean division of files between the corpora.

--Shahms
 

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/40bc1abb-6382-4481-b842-c9427dd12a91n%40googlegroups.com.


--
:-P

程洋

unread,
Jun 7, 2021, 5:29:00 AM6/7/21
to Kythe
1. Since there are 310k kzips with hash, can i just save all kzip filenames(which is the hash of index materials), and extracts everytime(it only takes 30 minutes i believe), and find new kzips, only index them and append it to the serving table ?
2. i will take look at vnames. json, i didn't notice it in the document

Shahms King

unread,
Jun 7, 2021, 10:39:34 AM6/7/21
to 程洋, Kythe
You can try that, but I don't have any idea what the results would look like.

--Shahms

michael.j....@gmail.com

unread,
Jun 7, 2021, 11:32:31 AM6/7/21
to Kythe
An implicit aspect of this problem is the difference between "indexing" and "serving". Specifically: The server expects the postprocessor to denormalize the graph to make reverse dependency relationships explicit. That turns out to be the key reason incremental index updating is tricky.

However, as the others have said, you can get some efficiencies: You have to run the build to see what has changed, but:

1. You can use the content address of the kzip to cache indexer output, so the indexer doesn't need to be re-run for every compilation record on every build.[see Caveat]
2. You can store indexer outputs in a more compact format, to (re)normalize away the repetition of the tuple format.
3. You can use content addresses to track which indexer outputs belong to each branch and/or version, to share common graph data.

This turns out to help quite a bit—but it doesn't completely solve the problem you've described: Simply "concatenating" new data to the serving tables isn't well-defined. That said: Depending on what queries you care about, that might or might not matter to you.

–M

[Caveat]: The cache key for indexer outputs must include not only the kzip but also the indexer itself: A new indexer version (e.g., to fix a bug) may produce different output even if the inputs do not change.

peter.l...@gmail.com

unread,
Jun 7, 2021, 5:11:13 PM6/7/21
to Kythe
On Monday, June 7, 2021 at 8:32:31 AM UTC-7 michael.j....@gmail.com wrote:
An implicit aspect of this problem is the difference between "indexing" and "serving". Specifically: The server expects the postprocessor to denormalize the graph to make reverse dependency relationships explicit. That turns out to be the key reason incremental index updating is tricky.

It's also possible to design a server that doesn't require denormalization, and therefore makes incremental indexing feasible.

I've written a prototype, using Prolog - nothing fancy; it just takes advantage of the ability to define "computed" facts, which are similar to "views" in SQL, but are more powerful. (The underlying concepts are Datalog, with examples being the Google projects Yedalog and Logica).
 

However, as the others have said, you can get some efficiencies: You have to run the build to see what has changed, but:

1. You can use the content address of the kzip to cache indexer output, so the indexer doesn't need to be re-run for every compilation record on every build.[see Caveat]
2. You can store indexer outputs in a more compact format, to (re)normalize away the repetition of the tuple format.
3. You can use content addresses to track which indexer outputs belong to each branch and/or version, to share common graph data.

This turns out to help quite a bit—but it doesn't completely solve the problem you've described: Simply "concatenating" new data to the serving tables isn't well-defined. That said: Depending on what queries you care about, that might or might not matter to you.

If you use a two-level implementation (base facts and deduced facts via Datalog), so there's no need for denormalization, then all you have to do is track the base facts by their source -- when loading new facts, you delete only the existing facts that are being updated from the same source. This is a well-known technique in the Prolog implementation community - there are also implementations of "just-in-time-indexing" that allow fast access to facts when they're updated. (The mechanism is more general - instead of replacing facts, you could do versioning. One example of a versioned graph database is TerminusDB, although I don't know enough about it to comment on whether it would make a good backend for a Kythe server - from what I know of it, it would be suitable, and could also handle scaling)

- p
Message has been deleted

程洋

unread,
Jun 8, 2021, 6:43:33 AM6/8/21
to Kythe
I will take a deep look at it. But i doubt.
AFAIK, serving table is larger than 30TB for an AOSP project. non- denormalization database may have big problem on performance, but i will try
Reply all
Reply to author
Forward
0 new messages