Questions about Kythe

902 views
Skip to first unread message

zel...@gmail.com

unread,
Feb 4, 2016, 10:22:05 AM2/4/16
to Kythe
Hi folks,

Is there a groups post or documentation page somewhere giving an overview of Kythe? I'm a Xoogler, particularly interested in the answers to the following questions:
  • How does Kythe compare to grok?
  • How does Kythe compare to codesearch?
  • How is Kythe related to grok and codesearch? Clearly it's not the same thing, because it looks like it only supports a couple of languages, whereas the internal tools understood many.
  • Is there a documentation page listing supported languages and their status? I see things like "and (soon) Go" scattered around, but I've no idea if that's up-to-date.
    • Kythe.io lists Indexers for C++ and Java, but an extractor for Go. What does it even mean to be able to extract for a language you can't index?
And the big one:

I'd like to set up something here at Square that allows you to browse all our source code: Protos, Java, Ruby, Go, ObjectiveC/Swift, Kotlin. I'd like all the types and symbols to be clickable and indexed. Basically, I want the Google-internal codesearch, but in a box. :-)  Is Kythe that? Or part of that? Or could it power that?

Thanks for your time, and sorry for the litany of questions: hopefully the fact that I as a recent Xoogler couldn't figure out the answers is useful information, and will lead to better docs.

Zellyn

Michael Fromberger

unread,
Feb 4, 2016, 10:48:43 AM2/4/16
to zel...@gmail.com, Kythe
Hi, Zellyn,

Some answers inline:

On Thu, Feb 4, 2016 at 7:22 AM, <zel...@gmail.com> wrote:
Hi folks,

Is there a groups post or documentation page somewhere giving an overview of Kythe? I'm a Xoogler, particularly interested in the answers to the following questions:
  • How does Kythe compare to grok?

Kythe is basically "open-source Grok". The basic approach is the same—instrumenting compilers to extract useful metadata in graph form—but decoupled from specifics of the internal implementation. The main difference you'll notice is that we don't yet provide an indexing service for external code—what we are working on now is building the tools to support writing one. And, as you observed, we haven't yet got as many languages covered in Kythe.
  • How does Kythe compare to codesearch?

CodeSearch is a UI and search tool for fast regexp-matching of source code. At Google, it uses Grok to get cross-reference data. There isn't any public analogue of CodeSearch (or at least, not anymore). But we'd like Kythe to have the same relationship to other such tools, in that Kythe-compatible language tools will be able to generate similar cross-references. We (Kythe) hope to find an external tool similar enough to CodeSearch that we can do an integration with it, for the benefit of open-source projects. There are a number of more basic things we need to do first, though.
  • How is Kythe related to grok and codesearch? Clearly it's not the same thing, because it looks like it only supports a couple of languages, whereas the internal tools understood many.
For the CodeSearch part, see above. Kythe (like Grok) is basically a mechanism for sharing data about code in a language-agnostic way—things like cross-references, diagnostics, documentation, etc. Internally as you noted we have a bunch of language plugins that haven't yet been adapted to the open-source project. In some cases that's just because nobody's done it yet—in others, because the existing implementation is too bound up in internal details for us to re-use it. We'll be adding more languages as time allows—and in particular would welcome contributions in that area, since it's way more than our small team can keep up with.
 
  • Is there a documentation page listing supported languages and their status? I see things like "and (soon) Go" scattered around, but I've no idea if that's up-to-date.
There isn't a page specific to that. Perhaps we ought to write one. Briefly: C++ and Java are supported; I'm working on Go (and have been for a while—it's nearly ready but I keep getting derailed by other work). We're exploring other languages (e.g., JavaScript, TypeScript, Python) but don't yet have indexers for them that work outside Google's very particular build environment. We've talked with a few people who've expressed interest in contributing indexers for languages like Scala and PHP, but this is a very open area right now. :)
    • Kythe.io lists Indexers for C++ and Java, but an extractor for Go. What does it even mean to be able to extract for a language you can't index?
Providing accurate cross-references requires accurate dependency information. "Extraction" is what we call the process of capturing dependency information from the build process. "Indexing" is what we call running over the output of the extractors to generate Kythe graph data. 
 
And the big one:

I'd like to set up something here at Square that allows you to browse all our source code: Protos, Java, Ruby, Go, ObjectiveC/Swift, Kotlin. I'd like all the types and symbols to be clickable and indexed. Basically, I want the Google-internal codesearch, but in a box. :-)  Is Kythe that? Or part of that? Or could it power that?

Kythe (like Grok) is meant to be the latter—a mechanism that supports building such a tool. It's basically a common representation that lets you plug in various languages and share data with tools like CodeSearch, without having to rewrite everything each time you want to add a new tool. 

Internally, all those pieces are already "up and running". Kythe isn't to the point where it can be just "dropped in" to an existing codebase, though that is the eventual goal. We're working on unifying Grok with Kythe—i.e., fairly soon Kythe will be what we use internally as well. Our plan is to gather a collection of reusable components (indexers, build-tool integrations, editors, UI tools, etc.) that speak this common representation, so that what you're describing could more easily be done. In practice, "gather" implies we're looking for outside contributions as well as writing things ourselves. There's way more than any half-dozen people can keep up with. :)
 
Thanks for your time, and sorry for the litany of questions: hopefully the fact that I as a recent Xoogler couldn't figure out the answers is useful information, and will lead to better docs.

Your questions are welcome! Please let me know if my answers haven't addressed your points adequately.

Cheers,
–M

zel...@gmail.com

unread,
Feb 4, 2016, 11:04:27 AM2/4/16
to Kythe, zel...@gmail.com
Thanks, Michael - your answers are everything I hoped for in terms of satisfying my curiosity, if not satisfying my desire to have it be finished already :-)

Zellyn

unread,
Dec 16, 2021, 9:42:16 AM12/16/21
to Kythe
Follow-up questions. Every now and then I remember this project and wonder what it's up to!

Is Grok using Kythe yet? How does Kythe compare to the more recent work SourceGraph and Github have been doing?

Thanks,

Zellyn
Message has been deleted

Zellyn Hunter

unread,
Dec 16, 2021, 11:49:23 AM12/16/21
to M. J. Fromberger, Kythe
Hey there! Thanks for the info. I don't really have a horse in the race, other than wanting Kythe to succeed: seems like it would be a useful thing to exist in open source :-)

It sounds like someone could bang together a Zoekt+Kythe+some UI thing reasonably easily, given the existing components. Maybe someone will eventually get the urge!

Take care!

Zellyn


On Thu, Dec 16, 2021 at 10:02 AM M. J. Fromberger <michael.j....@gmail.com> wrote:
Hi, Zellyn,

Well, it's been a while, but the short answer to your first question is: Yes. I left Google in 2018, but before I did we managed to get the internal indexing stack at Google fully converted to Kythe.

Unfortunately, I was never able to sell the idea to upper-management, that Google should be doing this as a service, so despite that conversion it remains mostly a curiosity outside Google. The one place you can get Kythe results is on https://cs.opensource.google/, which is a kind of progress, I suppose. I have moved on to other things, and I do not expect Kythe to ever amount to much outside Google.

I don't know how GitHub's indexing works internally, so I can't really answer how it compares. I will say, though, that the quality and coverage of the existing GH cross-references is pretty poor. On that basis, I am pretty sure they are not taking the Kythe approach. Having said that, I saw that GitHub is working on a new search product that seems to be closer to CodeSearch. I signed up for the developer preview, but all I know right now is what's on the announcement. Based on that, I don't expect they are doing "real" semantic indexing. That said: Once you have a good CodeSearch, that's a logical next step, and they definitely are well-positioned for it.

Sourcegraph uses a Kythe-like approach, but most of their indexing is search-based. Individual customers can opt in to semantic indexing, which—at least as of the last time I looked—was implemented by custom indexers emitting data in LSIF format. (This is consistent with other parts of the Sourcegraph stack, which were historically heavily VSCode & Language Server based; even as they've evolved away from LS specifically a lot of that history remains). Having said all that, my knowledge of Sourcegraph isn't very current—I did a brief stint at the company in 2020, but haven't looked recently.

I'm sorry I don't have more satisfying answers for you. I miss Grok all the time, but I don't foresee having it in my workflow again anytime soon.

Kind regards,
–M

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/dfbf4377-bfc3-41f0-ba02-d5d797f3ee87n%40googlegroups.com.

Shahms King

unread,
Dec 16, 2021, 11:58:54 AM12/16/21
to Zellyn Hunter, M. J. Fromberger, Kythe
I think it's less the UI that's the issue and more build dependencies. As always, build systems are terrible and build systems for C++ factorially more so.  Since Kythe aims to provide a precise semantic graph, that generally requires accurately capturing dependencies and generated code, rather than working from just a syntax tree. Since build systems are frequently loosely specified, highly customizable or both, actually hooking into them to get the relevant information to produce an index is challenging and often requires manual specification and a lot of hacks.

--Shahms



--
:-P

Zellyn Hunter

unread,
Dec 16, 2021, 12:40:18 PM12/16/21
to Shahms King, M. J. Fromberger, Kythe
Although getting it right generally is impossible, it seems like it would be tractable for many of the most frequent use cases, namely the default "normal" configuration for languages:
  • Java with Maven
  • Java with Gradle (maybe, if Gradle isn't programmable)
  • Java with Bazel (maybe)
  • Go with normal Go modules
  • Python with pip
  • Ruby with bundler
  • Node with npm
  • Rust with cargo
  • etc.

Zellyn

Daniel Moy

unread,
Dec 16, 2021, 12:47:11 PM12/16/21
to Zellyn Hunter, Shahms King, M. J. Fromberger, Kythe
So we have done some experimentation there.  You can get some percentage of repos to index auto-magically.  The problem is that it tends to not just be "Java with Maven", it's "Java with Maven, but then you have to hold it in this particular way", and then it starts to fall apart once you tack on Kythe's extraction bits, without a lot of handholding.

We ended up being able to successfully index only a small percentage of repos without a lot of manual tweaking.

Bazel tends to work better.

But yes, we had roughly your exact same line of thought (in fact, three separate people had this same idea, over a period of 4-5 years, and we all tried, and we all failed to make fast & good enough progress).

Daniel Moy | Google Software Engineer

Zellyn Hunter

unread,
Dec 16, 2021, 12:58:30 PM12/16/21
to Daniel Moy, Shahms King, M. J. Fromberger, Kythe
Heh. I assume at least three people have had the idea of working with the VSCode folks, Bazel folks, IntelliJ folks to create a standard dependency information file format that all the build systems could learn to output?

Zellyn

Daniel Moy

unread,
Dec 16, 2021, 1:02:29 PM12/16/21
to Zellyn Hunter, Shahms King, M. J. Fromberger, Kythe
Bazel is easier for us to work with (since it did originate from Google).  The others, I don't think we have tried to reach out for any standardization.

There's also the issue of xkcd/927

| Google Software Engineer

Shahms King

unread,
Dec 16, 2021, 1:38:49 PM12/16/21
to Zellyn Hunter, Daniel Moy, M. J. Fromberger, Kythe
Most build systems aren't hermetic and don't precisely track dependencies to begin with.  In some cases, the language-specific extractor is precise enough that all we really need are the compiler commands and then https://clang.llvm.org/docs/JSONCompilationDatabase.html can work.

--Shahms
--
:-P

Christian

unread,
Dec 16, 2021, 2:00:40 PM12/16/21
to M. J. Fromberger, Zellyn, Kythe
It is sad that Kythe would remain a curiosity out of Google.
I think there is a great venture to be started by using Kythe to index source code and provide some sort of codesearch clone
to any other company.

(In fact Sourcegraph provides just that, and they were a unicorn last time I checked, because they have no competitors basically, it's not that great :)).




Le jeu. 16 déc. 2021 à 16:02, M. J. Fromberger <michael.j....@gmail.com> a écrit :
Hi, Zellyn,

Well, it's been a while, but the short answer to your first question is: Yes. I left Google in 2018, but before I did we managed to get the internal indexing stack at Google fully converted to Kythe.

Unfortunately, I was never able to sell the idea to upper-management, that Google should be doing this as a service, so despite that conversion it remains mostly a curiosity outside Google. The one place you can get Kythe results is on https://cs.opensource.google/, which is a kind of progress, I suppose. I have moved on to other things, and I do not expect Kythe to ever amount to much outside Google.

I don't know how GitHub's indexing works internally, so I can't really answer how it compares. I will say, though, that the quality and coverage of the existing GH cross-references is pretty poor. On that basis, I am pretty sure they are not taking the Kythe approach. Having said that, I saw that GitHub is working on a new search product that seems to be closer to CodeSearch. I signed up for the developer preview, but all I know right now is what's on the announcement. Based on that, I don't expect they are doing "real" semantic indexing. That said: Once you have a good CodeSearch, that's a logical next step, and they definitely are well-positioned for it.

Sourcegraph uses a Kythe-like approach, but most of their indexing is search-based. Individual customers can opt in to semantic indexing, which—at least as of the last time I looked—was implemented by custom indexers emitting data in LSIF format. (This is consistent with other parts of the Sourcegraph stack, which were historically heavily VSCode & Language Server based; even as they've evolved away from LS specifically a lot of that history remains). Having said all that, my knowledge of Sourcegraph isn't very current—I did a brief stint at the company in 2020, but haven't looked recently.

I'm sorry I don't have more satisfying answers for you. I miss Grok all the time, but I don't foresee having it in my workflow again anytime soon.

Kind regards,
–M

On Thu, Dec 16, 2021 at 6:42 AM Zellyn <zel...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kythe/dfbf4377-bfc3-41f0-ba02-d5d797f3ee87n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kythe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kythe+un...@googlegroups.com.

Robin Palotai

unread,
Dec 16, 2021, 2:14:09 PM12/16/21
to Shahms King, Zellyn Hunter, Daniel Moy, M. J. Fromberger, Kythe
Hi Zellyn & guys,

Xoogler here too, I'm quite fond of searching code, and can relate to what was shared in this thread. I'm also very enthusiastic about Kythe's precision (see some jotted comparison with LSIF), but encounter the following problems in practice:
- Complexity of tapping into a build pipeline - as Shahms said, there's always some customization needed. Java using Lombok? Need to convince Kythe to apply the same class preprocessor plugin. C++ that doesn't build nicely with clang? Bad luck.
- Cross-language linking - details are foggy, but let's just say protobuf to C++ linking doesn't work out of the box.
- Kythe postprocessing pipeline - the standalone pipeline doesn't emit all the info, while the Beam-based pipeline is in Golang, which (last time I checked) was not a first-class Beam citizen, and couldn't get it runnink with Flink (or GCP). This left you no way with postprocessing medium-sized indices into a serving table.

So while precise linking would be ideal, I tend to think sometimes that having fast local parsing + good-enough heuristics (or external support) to decide when there's linking ambiguity would be more productive. I don't know the details of Github's internals, but at least their parser is open-source (semantic, written in Haskell, uses tree-sitter grammars). The parser gives you local symbols, but then the internal magic is in inferring nonlocal linking I guess. FB's Glean is pretty promising too, not much indexers there yet.

As for Zoekt+some UI, I wrote https://github.com/TreeTide/zoekt-underhood which exposes a zoekt index through an API consumable by https://github.com/TreeTide/underhood/tree/robinp-uispeed (tip of that branch broke the kythe gateway compatibility temporarily to work with zoekt-underhood... would be nice to reconcile eventually, but time..). It navigates using text search - if you click (with various click modifiers) on a piece of text, it brings backrefs from Zoekt's text search. Less ideal than true crossrefs, but better than pingpoinging between Zoekt results and github. Not production-ready though, would need to handle different indexed branches, zoekt index changing under the UI, etc.

And then the real world is messy. There's tons of domain-specific yaml, json (also throw in jinja templates, helm and jsonnet to the mix). You have no chance to write kythe indexers to all of these, so some flexibility in the tooling is needed, again probably along with some good (user-definable?) heuristics.

But it is great to see lot of development in the area. I'm hoping for a better crosslinked future.

Robin

Reply all
Reply to author
Forward
0 new messages