Limiting Kythe extraction/indexing output

129 views

Skip to first unread message

Filip Szewczyk

unread,

Apr 4, 2024, 6:26:01 AM4/4/24

to Kythe

Hi,

I'm trying to use Kythe on a large C++ project with multiple 3rd party libraries and I've run into a few issues with the size and contents of the Kythe output

I'm running extraction using

${KYTHE_DIR}/tools/runextractor cmake -extractor=${KYTHE_DIR}/extractors/cxx_extractor -sourcedir=${CMAKE_PROJECT_DIR}

This produces a large but reasonable output of ~3300 kzip files that sum up to 9.1 GB which is in line with the amount of cpp files in the project. Execution time is also relatively short, especially compared to indexing.

I've got some problems when it comes to indexing. Running:

${KYTHE_DIR}/indexers/cxx_indexer --ignore_unimplemented -- ${KYTHE_OUTPUT_DIRECTORY}/*.kzip >> entries

produces a massive output of over 3TB, most of it duplicates that I tried removing later (either by custom script of using /tools/entrystream).

Running indexing on each kzip in batches eg. indexing 100 files at once and removing duplicates on the batch helps in managing the diskspace the indexing produces, but the process is painfully slow.

To help with the size of the output I have a few questions:

1. Is there a way to exclude specific paths/directories from extraction/indexing? Most of the entries are 3rd party libraries that live in one directory, so excluding those from the output would help as I don't need a graph for that

2. I found some caching params in the cxx_indexer documentation, but I'm struggling to find the details of how to use it and how it actually works. Could caching work for this specific issue? Can you point me to any examples on how to use it?

3. Is there a way to exclude specific nodes or edges from the output?

Please let me know if you have any other suggestions to speed up the process and/or use less resources

Regards,

Filip Szewczyk

michael.j....@gmail.com

unread,

Apr 5, 2024, 9:42:05 PM4/5/24

to Kythe

Hi, Filip,

Generally for C++ you will need to enable some sort of claiming to deal with the inherent redundancy across many related translation units. Probably the easiest way to handle that is to use the --experimental_dynamic_claim_cache flag and run a memcache instance alongside the indexer. This isn't entirely perfect, but should ordinarily (substantially) reduce the amount of duplicate work your indexers do on shared header files. Even then, there will be some duplication you'll want to fold out (you can pipe the output through the dedup_stream program for this). The main drawback to using cache-based ("dynamic") claiming is you can't safely cache the indexer outputs from previously-indexed compilation units, as the claim decisions are not deterministic.