Re: Utilize kythe on linux kernel

Shahms King

unread,

Jul 30, 2021, 11:08:19 AM7/30/21

to 程洋, Kythe

It means the extractor is being called on a link command (the ".o" files) which it cannot handle; it only understands C/C++ compilation, not linking or assembly.

--Shahms

On Fri, Jul 30, 2021 at 2:28 AM 程洋 <d171...@gmail.com> wrote:

I tried to ultilize kythe on linux kernel.
What i did is just replace HOSTCC to cxx_extrator. for easy debug, i print all shell arguments pass to cxx_extrator.

One of command i got error
"error: unable to handle compilation, expected exactly one compiler job in ''"

The full commnad is
"/home/cy/Documents/Sources/Github/kythe-v0.0.49/extractors/cxx_extractor -L/usr/lib -L/usr/lib/x86_64-linux-gnu -o scripts/kconfig/conf scripts/kconfig/conf.o scripts/kconfig/confdata.o scripts/kconfig/expr.o scripts/kconfig/lexer.lex.o scripts/kconfig/parser.tab.o scripts/kconfig/preprocess.o scripts/kconfig/symbol.o"

What does this error mean?

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/0af45ee6-fcc9-4bf3-ad85-d7714403f8b2n%40googlegroups.com.

--

:-P

Salvatore Guarnieri

unread,

Jul 30, 2021, 4:03:07 PM7/30/21

to 程洋, Kythe

The extractor is designed to be used for source compilation commands, but it looks like you are using it in an unsupported way.

程洋

unread,

Jul 31, 2021, 12:56:20 PM7/31/21

to Kythe

I figure it out, it seems this is a link command.

程洋

unread,

Aug 16, 2021, 1:20:05 PM8/16/21

to Kythe

Another thing i'm wondering, why cxx_indexer doesn't have options for multi process or multi threads?

It seems there are multiple kzips for the same projects, and it seems better to index those kzips at one time?

Shahms King

unread,

Aug 16, 2021, 1:22:03 PM8/16/21

to 程洋, Kythe

Because it's expected that if you need to run the indexers in parallel you will bring up the multiple processes manually and feed them compilations using analyzer_driver.

--Shahms

To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/9a9a0b2a-436e-4767-90a2-d7210f6bf24cn%40googlegroups.com.

--

:-P

peter.l...@gmail.com

unread,

Aug 16, 2021, 2:16:41 PM8/16/21

to Kythe

On Monday, August 16, 2021 at 10:22:03 AM UTC-7 sha...@google.com wrote:

Because it's expected that if you need to run the indexers in parallel you will bring up the multiple processes manually and feed them compilations using analyzer_driver.

This is fine if you have something like Bazel and fully expanded imports lists; it doesn't work very well for languages like Python. (There's also the split between the "extractor" phase and the "indexer" phase; Python differs quite in this division of work, compared to C++ or Java)

I've figured out a way to get good parallelism for Python without using something like Bazel. If you like, I can try writing it up; it should also work for C++ and Java.

- peter

Message has been deleted

程洋

unread,

Aug 23, 2021, 9:21:29 AM8/23/21

to Kythe

is there any way to accelerate the speed of kythe?

i've already run kythe on linux kernel successfully, which takes 30 hours for 3000 kzips

now i'm back to my initial purpose: running kythe on AOSP projects, which has 300k kzips.

i found it's fast to run cxx_indexer, which only takes 20 hours to generate delimited entries from kzips.

However it seems the hardest part is parsing entries into graphsotre. which takes 30 days even i run write_entries with 100 workers.

at first i think i got to use distrubition system like spark or hadoop to handle it.

but at 100 workers, cpu is almost idle which only takes 10% for 64 cores.

hard disk read speed is only 10 MB/s for reading entries files(entries files are almost 30TB)

hard disk write speed is only 70MB/s for writing graphstore.

whic means the bottleneck is not at hardware. Where is the performance bottleneck?

Shahms King

unread,

Aug 23, 2021, 10:48:30 AM8/23/21

to 程洋, Kythe

For C and C++ performance is generally best improved by:

1) running several-to-many indexer processes in parallel

2) using a memcached server for "claiming" by setting the --experimental_dynamic_claim_cache flag

--Shahms

On Mon, Aug 23, 2021 at 6:18 AM 程洋 <d171...@gmail.com> wrote:

is there any way to accelerate the speed of kythe?
i've already run kythe on linux kernel successfully, which takes 30 hours for 3000 kzips

now i'm back to my initial purpose: running kythe on AOSP projects, which has 300k kzips.
i found it's fast to run cxx_indexer, which only takes 20 hours to generate delimited entries from kzips.
However it seems the hardest part is parsing entries into graphsotre. which takes 30 days even i run write_entries with 100 workers.

Where is the performance bottleneck?

On Tuesday, August 17, 2021 at 2:16:41 AM UTC+8 peter.l...@gmail.com wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/88454388-043d-4997-9d84-47cfa5778d06n%40googlegroups.com.

--

:-P

michael.j....@gmail.com

unread,

Aug 23, 2021, 11:34:02 AM8/23/21

to Kythe

In addition to Shahms's suggestions, I will add that I've found the write performance of LevelDB to be generally quite poor (this is probably only made worse by the cgo binding layer). As you observed, the C++ indexer itself performs quite well—though it does generate a lot of output. Writing the results (particularly with a lot of duplication) can be a real bottleneck. Claiming should help reduce that quite a bit, and if you can write a larger number of smaller graphstore shards, it should also help.

I did a little bit of experimenting with using BadgerDB as a backend instead of LevelDB, and the results seem promising, but I haven't put in the work to measure it with a real workload. I suspect it would help quite a bit.

–M

程洋

unread,

Aug 23, 2021, 12:54:00 PM8/23/21

to Kythe

it seems cxx_indexer speed is quite well. as mentioned by michael, the bottleneck should be writing performance of levelDB

程洋

unread,

Aug 27, 2021, 1:52:58 PM8/27/21

to Kythe

I did some experiments.

it seems that the problem is bad performance of write_entries tool.

when running write_entries tool on big protobuf entries, my reading speed of idsk is merely 10MB, and benchmark of my disk should be 100MB.

during the processing, neither my disk reading, disk writing, nor cpu is in tense state. write_entries doesn't really use all hardware resources. Since levelDB writing performance is so great, i think it's the problem of bad coding of write_entries itself

I think i need some time to find out what happened there but i'm new to go, so i decide to write a python program to test where the problem is.

Message has been deleted

程洋

unread,

Sep 2, 2021, 12:02:32 AM9/2/21

to Kythe

I got some idea.

since the entries produced by indexer is streaming protobuf. Then the key to evaluate write_entries performance should be "reading speed of disk". Because entries files are sequenced protobuf. If you read 20% of the entries, then you already finished 20% of the job.

At first, i modified write_entries, because the length of WriteRequest.Update is frequently just 1 or 2. Which mean write_entries doesn't buffer anything and write small data into levelDB every time.

so i make write_entries to write 2GB data to levelDB in one time. Which incrase the performance remarkably.

However later i found the problem is not from write_entries itself. It's because there are too many deduplication in entries. (i generate 600GB entries from linux kernel, and dedup_stream turns it into 6GB file.)

Using dedup_stream tool can also maximize the HDD reading speed. and it's significantly better than my modifed write_entries tool.

So i think the good approach is to eliminate deduplications from start -> merge the kzips.

However if i merge kzips all in one, i cannot run multiple indexer to ultilize my multi-core machine.

Is there any chance to make indexer to know the same file from different kzips? and prevent it from producing deduplications?

michael.j....@gmail.com

unread,

Sep 2, 2021, 12:19:16 AM9/2/21

to Kythe

Re: Preventing duplication. This is the role of the "claiming" mechanism Shahms mentioned up-thread. The indexer doesn't know in advance which files it can skip (notably headers), because C++ compilation units do not necessarily expand the same tokens for each inclusion. Claiming essentially caches inclusion traces so that the indexer can avoid most of that work. It is also possible to cache the indexer output keyed by the indexer version and the digest of the compilation record, so that the indexer doesn't need to be re-run for compilations that recur in a subsequent build.

Even using both of those tactics, however, you will definitely want to always dedup indexer outputs. It's useful for any indexer, but especially for C++. I'm sorry I didn't mention that previously, I had assumed you were already using dedup_stream.

–M

程洋

unread,

Sep 2, 2021, 2:37:41 AM9/2/21

to Kythe

how does "experimental_dynamic_claim_cache" work?

Is there document to describe it?

程洋

unread,

Sep 2, 2021, 2:46:26 AM9/2/21

to Kythe

to be precise, i don't understand what is "claim", is that a specific part of indexer process?

M. J. Fromberger

unread,

Sep 2, 2021, 8:59:22 AM9/2/21

to 程洋, Kythe

There is a doc explaining this in more detail: kythe/cxx/indexer/cxx/claiming.md

Briefly: Setting this flag allows C++ indexer replicas to use a shared memcached server to assert "claims" on header files that they index, so that other replicas can avoid re-indexing the same header with the same preprocessor state (the file contents alone aren't enough, because what the file expands to depends on what the preprocessor sees in its enclosing compilation unit). The idea is you run the memcached alongside the pool of indexers and point them to it with this flag.

Claiming should greatly reduce the amount of duplicate data the indexers generate. On the downside, using a dynamic claim cache makes it unsafe to cache indexer outputs for a given compilation record, since the "owner" of the claim is a happenstance of indexing order and timing. However, the gains in efficiency from claiming are probably worth it. (The doc also describes a "static" claiming rule that is more work to set up, but which should allow the results to be cacheable)

–M

You received this message because you are subscribed to a topic in the Google Groups "Kythe" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kythe/YJrfmovaWZ8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/529a0594-8bb5-4caa-9cbf-45df9079a014n%40googlegroups.com.

Reply all

Reply to author

Forward