Indexing large projects

399 views
Skip to first unread message

tyle...@gmail.com

unread,
Jun 10, 2020, 2:15:30 PM6/10/20
to Kythe
I have been attempting to use Kythe to index a large source tree. As an example I am trying this out on Google's v8 javascript engine.

First off, I had a little bit of difficulty setting up the extraction. For example, Kythe did not understand what to do with compiler commands that included response files with the @./response_file syntax, and would simply error out. My understanding is that projects like v8 and chromium are set up for indexing/extraction with Kythe--which is part of why I was using v8 as a test--but I couldn't find any information on that. It might be a helpful piece of documentation for those trying to integrate Kythe with similarly large-scale projects. I've mostly been following https://kythe.io/examples to try to understand how to do things, but it seems likely I've been messing something up.

Eventually I got extraction working and ended up with about 2GB of kzip files, which seems pretty reasonable. I then merged these with the merge tool, which ended up producing a shockingly small ~90MB kzip file.
Now when I go to run /opt/kythe/indexers/cxx_indexer to produce.... well I'm not exactly sure yet. After running out of free disk space the first time, I restarted cxx_extractor and rather than writing to disk, piped its output directly to /opt/kythe/tools/write_entries, with pv in between them so I get a little bit of introspection.
It's been running for over 20 hours now, and has output about 440GB(!) of data to write_entries, which in turn so far has produced a pretty reasonable 4GB database.

So I guess my questions are:
1) In general, how does one run Kythe on large scale projects? Is there some clean/easy interface I have missed? Even if it requires work to integrate for other projects, just seeing how Kythe integrates with v8 or chromium as an example would be useful.
2) What should I be expecting from cxx_indexer? how much data should I expect it to output from a large extraction? Is several hundreds of gigabytes normal?
3) How long should one expect Kythe to complete indexing on a large scale project? Is dozens of hours normal? Is there some option or setting or alternate way to use the data that will make it faster?

I appreciate Kythe's interface and accuracy compared to other indexing tools so I would love to use it. I assume I messed something up on my end that is causing it to be so slow, but am not sure what it could be.

Cheers,
  Tyler

Evan Martin

unread,
Jun 10, 2020, 2:24:12 PM6/10/20
to tyle...@gmail.com, Kythe
This doesn't answer your actual question, but just in case it's helpful: Chrome's code search has the v8 code xref'ed via Kythe.

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/88effce7-b8f6-41d3-82ea-ba5d208f993do%40googlegroups.com.

Tyler Nighswander

unread,
Jun 10, 2020, 2:29:27 PM6/10/20
to Evan Martin, Kythe
Thanks!
I've seen (and really liked!) the code search Google has available publicly. That is part of why I assume there must be a way to extract and index the code efficiently on large projects and I must be doing something wrong, since clearly that use case is well supported :)

Cheers,
  Tyler

Shahms King

unread,
Jun 10, 2020, 2:43:37 PM6/10/20
to Evan Martin, tyle...@gmail.com, Kythe
On Wed, Jun 10, 2020 at 11:24 AM 'Evan Martin' via Kythe <ky...@googlegroups.com> wrote:
This doesn't answer your actual question, but just in case it's helpful: Chrome's code search has the v8 code xref'ed via Kythe.

On Wed, Jun 10, 2020 at 11:15 AM <tyle...@gmail.com> wrote:
I have been attempting to use Kythe to index a large source tree. As an example I am trying this out on Google's v8 javascript engine.

First off, I had a little bit of difficulty setting up the extraction. For example, Kythe did not understand what to do with compiler commands that included response files with the @./response_file syntax, and would simply error out. My understanding is that projects like v8 and chromium are set up for indexing/extraction with Kythe--which is part of why I was using v8 as a test--but I couldn't find any information on that. It might be a helpful piece of documentation for those trying to integrate Kythe with similarly large-scale projects. I've mostly been following https://kythe.io/examples to try to understand how to do things, but it seems likely I've been messing something up.

It depends a great deal on which extractor you're using as to whether or not they support@-style parameter files (and which syntax -- there is no single standard for such files).  Which extractor were you having issues with?
 

Eventually I got extraction working and ended up with about 2GB of kzip files, which seems pretty reasonable. I then merged these with the merge tool, which ended up producing a shockingly small ~90MB kzip file.
Now when I go to run /opt/kythe/indexers/cxx_indexer to produce.... well I'm not exactly sure yet. After running out of free disk space the first time, I restarted cxx_extractor and rather than writing to disk, piped its output directly to /opt/kythe/tools/write_entries, with pv in between them so I get a little bit of introspection.
It's been running for over 20 hours now, and has output about 440GB(!) of data to write_entries, which in turn so far has produced a pretty reasonable 4GB database.

2GB -> 90MB seems pretty reasonable.  The bulk of the individual .kzip files are generally the dependencies and those are frequently heavily duplicated between compilation units.  As a result, merging them will reduce the size dramatically.  C++ takes a long time to index and generally results in a *lot* of duplication for a variety of reasons, but mostly templates.  Asking the C++ indexer to index a single large kzip with many compilations will serialize indexing and be very slow.  You can speed things up by invoking the indexer in parallel on the individual compilation units (rather than merging them) and combining the output.  Additionally, there are a variety of flags available on the C++ indexer itself which can help dramatically, the biggest being claiming (particularly --cache and --experimental_dynamic_claim_cache) and template (especially experimental_alias_template_instantiations) related flags.  Some of these flags do reduce or simplify the output graph somewhat, though.

Unfortunately, the flags themselves are a bit scattered around the code:
 

So I guess my questions are:
1) In general, how does one run Kythe on large scale projects? Is there some clean/easy interface I have missed? Even if it requires work to integrate for other projects, just seeing how Kythe integrates with v8 or chromium as an example would be useful.

See above (I'm not sure what I'm allowed to discuss about the internal deployment, so apologies for being vague).
 
2) What should I be expecting from cxx_indexer? how much data should I expect it to output from a large extraction? Is several hundreds of gigabytes normal?

Yes, especially if you aren't using claiming and are exhaustively indexing templates.
 
3) How long should one expect Kythe to complete indexing on a large scale project? Is dozens of hours normal? Is there some option or setting or alternate way to use the data that will make it faster?

There are many that can help, but the biggest are likely to be claiming and parallelization.

--Shahms
 

I appreciate Kythe's interface and accuracy compared to other indexing tools so I would love to use it. I assume I messed something up on my end that is causing it to be so slow, but am not sure what it could be.

Cheers,
  Tyler

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/88effce7-b8f6-41d3-82ea-ba5d208f993do%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.

Robin Palotai

unread,
Jun 10, 2020, 2:51:26 PM6/10/20
to Tyler Nighswander, Evan Martin, Kythe
Hi Tyler!

I'm also "struggling" with C++, in that I try to manage an extraction. Let me share my commands so far (many inspired by kythe/release/kythe.sh in the repo). I didn't know about the merge tool, will try it (maybe I wouldn't need the memcached based claiming?)

Extraction:
- Run bazel with "-output_base /mnt/data/kythe/output_base". So your home doesn't run out of disk space. Also, your cache won't get trashed if you change compiler options (if you keep the output_base separate... maybe).
- Also with "--bazelrc=/opt/kythe/extractors.bazelrc", maybe after commenting out mistyped extractor names and disabling the proto toolchains (which don't seem to work for me even after fiddling, but maybe it is just me)

- find /mnt/data/kythe/output_base/execroot/io_kythe/bazel-out/k8-opt/extra_actions/ -name '*.cxx.kzip' | sort  > units

Indexing:
- start "memcached" after apt-getting it.
- time cat units | parallel --gnu --tmpdir /mnt/data/tmp -L1 ./bazel-bin/kythe/cxx/indexer/cxx/indexer --ignore_unimplemented --experimental_index_lite --experimental_dynamic_claim_cache="--SERVER=localhost:11211" -cache="--SERVER=localhost:11211" -cache_stats | ./bazel-bin/kythe/go/platform/tools/dedup_stream/dedup_stream | ./bazel-bin/kythe/go/storage/tools/write_entries/write_entries -workers 12 -graphstore /mnt/data/kythe/kythe_repo/gs2

Serving tables:
- You will struggle. Beam pipeline may work in GCE, but not (efficiently) locally, and not yet on Flink. Legacy pipeline will still be somewhat slow (but atleast uses concurrent disksorting? Maybe I'm wrong.)
- I'm working on some code to have an alternative serving representation as a hobby. Others in the list mentioned having other improvements in the pipes.
- UI. Well, there's no good UI. My own project Underhood (https://github.com/TreeTide/underhood) is going over some facelift in the robinp-uispeed branch, but I wouldn't advertise it as production-ready (or anything-ready). Will see if the alternative serving representation would help on iterating it.

But quail not! Kythe is too good to give up on. I'm sure something will eventually work in the end.

Robin

Shahms King

unread,
Jun 10, 2020, 2:57:46 PM6/10/20
to Robin Palotai, Tyler Nighswander, Evan Martin, Kythe
On Wed, Jun 10, 2020 at 11:51 AM Robin Palotai <palota...@gmail.com> wrote:
Hi Tyler!

I'm also "struggling" with C++, in that I try to manage an extraction. Let me share my commands so far (many inspired by kythe/release/kythe.sh in the repo). I didn't know about the merge tool, will try it (maybe I wouldn't need the memcached based claiming?)

Thanks, Robin!

You'll definitely still need claiming; that primarily helps with templates and headers.

--Shahms
 

tyle...@gmail.com

unread,
Jun 10, 2020, 3:04:58 PM6/10/20
to Kythe


On Wednesday, June 10, 2020 at 11:43:37 AM UTC-7, Shahms King wrote:


On Wed, Jun 10, 2020 at 11:24 AM 'Evan Martin' via Kythe <ky...@googlegroups.com> wrote:
This doesn't answer your actual question, but just in case it's helpful: Chrome's code search has the v8 code xref'ed via Kythe.

On Wed, Jun 10, 2020 at 11:15 AM <tyle...@gmail.com> wrote:
I have been attempting to use Kythe to index a large source tree. As an example I am trying this out on Google's v8 javascript engine.

First off, I had a little bit of difficulty setting up the extraction. For example, Kythe did not understand what to do with compiler commands that included response files with the @./response_file syntax, and would simply error out. My understanding is that projects like v8 and chromium are set up for indexing/extraction with Kythe--which is part of why I was using v8 as a test--but I couldn't find any information on that. It might be a helpful piece of documentation for those trying to integrate Kythe with similarly large-scale projects. I've mostly been following https://kythe.io/examples to try to understand how to do things, but it seems likely I've been messing something up.

It depends a great deal on which extractor you're using as to whether or not they support@-style parameter files (and which syntax -- there is no single standard for such files).  Which extractor were you having issues with?

I was using the cxx_extractor to do this. I had some issues using the compile_commands.json and ended up wrapping the clang that is packaged with v8 with a shell script that also ran cxx_extractor.
 
 

Eventually I got extraction working and ended up with about 2GB of kzip files, which seems pretty reasonable. I then merged these with the merge tool, which ended up producing a shockingly small ~90MB kzip file.
Now when I go to run /opt/kythe/indexers/cxx_indexer to produce.... well I'm not exactly sure yet. After running out of free disk space the first time, I restarted cxx_extractor and rather than writing to disk, piped its output directly to /opt/kythe/tools/write_entries, with pv in between them so I get a little bit of introspection.
It's been running for over 20 hours now, and has output about 440GB(!) of data to write_entries, which in turn so far has produced a pretty reasonable 4GB database.

2GB -> 90MB seems pretty reasonable.  The bulk of the individual .kzip files are generally the dependencies and those are frequently heavily duplicated between compilation units.  As a result, merging them will reduce the size dramatically.  C++ takes a long time to index and generally results in a *lot* of duplication for a variety of reasons, but mostly templates.  Asking the C++ indexer to index a single large kzip with many compilations will serialize indexing and be very slow.  You can speed things up by invoking the indexer in parallel on the individual compilation units (rather than merging them) and combining the output.  Additionally, there are a variety of flags available on the C++ indexer itself which can help dramatically, the biggest being claiming (particularly --cache and --experimental_dynamic_claim_cache) and template (especially experimental_alias_template_instantiations) related flags.  Some of these flags do reduce or simplify the output graph somewhat, though.

Unfortunately, the flags themselves are a bit scattered around the code:
 
 
 Oh awesome! I just ran with --help and didn't see any flags (no worries--I know stuff is still in progress), those sound super helpful! Thanks for the tips!


So I guess my questions are:
1) In general, how does one run Kythe on large scale projects? Is there some clean/easy interface I have missed? Even if it requires work to integrate for other projects, just seeing how Kythe integrates with v8 or chromium as an example would be useful.

See above (I'm not sure what I'm allowed to discuss about the internal deployment, so apologies for being vague).
 
2) What should I be expecting from cxx_indexer? how much data should I expect it to output from a large extraction? Is several hundreds of gigabytes normal?

Yes, especially if you aren't using claiming and are exhaustively indexing templates.
 
3) How long should one expect Kythe to complete indexing on a large scale project? Is dozens of hours normal? Is there some option or setting or alternate way to use the data that will make it faster?

There are many that can help, but the biggest are likely to be claiming and parallelization.

--Shahms
 

I appreciate Kythe's interface and accuracy compared to other indexing tools so I would love to use it. I assume I messed something up on my end that is causing it to be so slow, but am not sure what it could be.

Cheers,
  Tyler

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ky...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ky...@googlegroups.com.


--
:-P
 
 Thanks for the quick response and helpful tips!

tyle...@gmail.com

unread,
Jun 10, 2020, 3:13:49 PM6/10/20
to Kythe


On Wednesday, June 10, 2020 at 11:51:26 AM UTC-7, Robin Palotai wrote:
Hi Tyler!

I'm also "struggling" with C++, in that I try to manage an extraction. Let me share my commands so far (many inspired by kythe/release/kythe.sh in the repo). I didn't know about the merge tool, will try it (maybe I wouldn't need the memcached based claiming?)

Extraction:
- Run bazel with "-output_base /mnt/data/kythe/output_base". So your home doesn't run out of disk space. Also, your cache won't get trashed if you change compiler options (if you keep the output_base separate... maybe).
- Also with "--bazelrc=/opt/kythe/extractors.bazelrc", maybe after commenting out mistyped extractor names and disabling the proto toolchains (which don't seem to work for me even after fiddling, but maybe it is just me)

- find /mnt/data/kythe/output_base/execroot/io_kythe/bazel-out/k8-opt/extra_actions/ -name '*.cxx.kzip' | sort  > units

Indexing:
- start "memcached" after apt-getting it.
- time cat units | parallel --gnu --tmpdir /mnt/data/tmp -L1 ./bazel-bin/kythe/cxx/indexer/cxx/indexer --ignore_unimplemented --experimental_index_lite --experimental_dynamic_claim_cache="--SERVER=localhost:11211" -cache="--SERVER=localhost:11211" -cache_stats | ./bazel-bin/kythe/go/platform/tools/dedup_stream/dedup_stream | ./bazel-bin/kythe/go/storage/tools/write_entries/write_entries -workers 12 -graphstore /mnt/data/kythe/kythe_repo/gs2


This looks great, thanks for the detailed directions on how to use the caching, that's super helpful. I will try this out.
 
Serving tables:
- You will struggle. Beam pipeline may work in GCE, but not (efficiently) locally, and not yet on Flink. Legacy pipeline will still be somewhat slow (but atleast uses concurrent disksorting? Maybe I'm wrong.)
- I'm working on some code to have an alternative serving representation as a hobby. Others in the list mentioned having other improvements in the pipes.
- UI. Well, there's no good UI. My own project Underhood (https://github.com/TreeTide/underhood) is going over some facelift in the robinp-uispeed branch, but I wouldn't advertise it as production-ready (or anything-ready). Will see if the alternative serving representation would help on iterating it.


Good to know. I poked a bit at Underhood and it looks nice: I'm glad there is some development going on for an alternate UI! I am still not sure where I'm going or if Kythe will take me there, and if that will involve trying to create yet another frontend for Kythe or not...
 
To unsubscribe from this group and stop receiving emails from it, send an email to ky...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ky...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages