Is it possible to do incresemental update?

程洋

unread,

Mar 22, 2022, 12:09:46 AM3/22/22

to Kythe

For large repositories like AOSP. if we index it would take days to complete.

But current run_extractor always do full compile.

Is there anyway to do only incresemental update for the database if source code just change a little bit

Shahms King

unread,

Mar 22, 2022, 11:34:29 AM3/22/22

to 程洋, Kythe

Incremental updates are not presently possible. Generally, the way to manage indexing of large repositories is by doing so in parallel. This is a big reason why the compilation units produced from extraction are self-contained.

--Shahms

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/f5644b9e-3c89-4009-96c3-7990f5b3b12an%40googlegroups.com.

--

:-P

michael.j....@gmail.com

unread,

Mar 22, 2022, 12:25:27 PM3/22/22

to Kythe

As Shahms points out, there's not a straightforward way to update the graph incrementally. That said: In some cases it may be possible to cache previous indexer outputs, to avoid having to completely reindex the project for every change. While you'd still have to run a build to determine which compilation units have changed, you may be able to cache the indexer outputs from previous builds of compilation units that do not change (e.g., none of their files or dependencies have been updated). That can save a fair bit of overhead for indexing and generating the underlying forward graph. Just be aware that "cacheability" of indexer outputs depends on the language to some extent.

Even with caching, updating denormalized serving databases will still require processing all the indexer outputs for a given revision (whether cached or not), as the effects of even small changes can affect arbitrarily-many elements of the reverse dependency relationship.

–M

程洋

unread,

Mar 23, 2022, 11:03:11 PM3/23/22

to Kythe

That's what i suppose.

So the index ouput is self-contailed, and i just need to recompile the changed source file (like *.c), and the merge to output into previous database?

michael.j....@gmail.com

unread,

Mar 24, 2022, 9:58:16 AM3/24/22

to Kythe

For C, unfortunately, the situation is made more complex by the problem of claiming. Since header files may participate in many compilations, indexing them directly would result in a lot of duplicate indexer output. To work around this, the C++ indexer uses a dynamic "claiming" mechanism to ensure the header is only indexed once for each of its variations (e.g., how it is expanded under the preprocessor in different files). The claiming mechanism makes the outputs not directly cacheable, since the claim assignment is not easily recoverable after the fact.

For languages without this issue, for example Java and Go, indexer outputs can be cached and reused.

Updating the graph takes a bit more than simply "adding in" the updated data, since the new data may have removed entries that were previously there. Also, a simple change to one compilation unit may require large changes to the rest of the graph, so updating it completely generally requires scanning the whole graph. (Consider, for example, a change that adds a new overload that will be preferred by calls from existing code that has not otherwise changed).

So: Caching doesn't eliminate the need to rebuild the graph, but it can (where applicable) reduce the amount of indexing you have to do for a build.

–M

peter.l...@gmail.com

unread,

Mar 24, 2022, 11:17:47 AM3/24/22

to Kythe

On Tuesday, March 22, 2022 at 9:25:27 AM UTC-7 michael.j....@gmail.com wrote:

As Shahms points out, there's not a straightforward way to update the graph incrementally. That said: In some cases it may be possible to cache previous indexer outputs, to avoid having to completely reindex the project for every change. While you'd still have to run a build to determine which compilation units have changed, you may be able to cache the indexer outputs from previous builds of compilation units that do not change (e.g., none of their files or dependencies have been updated). That can save a fair bit of overhead for indexing and generating the underlying forward graph. Just be aware that "cacheability" of indexer outputs depends on the language to some extent.

Even with caching, updating denormalized serving databases will still require processing all the indexer outputs for a given revision (whether cached or not), as the effects of even small changes can affect arbitrarily-many elements of the reverse dependency relationship.

If you used something like datalog rules instead of pre-computing the denormalized database, then incremental updates would be feasible. Updating the base facts is a solved problem - that's how Prolog systems do "reconsult" (basically, every fact is labeled with its source, so reloading a data source first deletes all the facts that have its label).