Hi,
here is a request for comments on informal proposal for a new feature for bblfsh project.
Below I'll try to describe the problem context and a possible solution.
All prior discussion on this subject is limited mostly to [9].
Problem:
1. language names that enry uses are different from names that bblfsh drivers used
e.g. bash-driver parses number of *sh languages, while linguist has "shell" (that groups zsh, bash and sh)
same for csharp-driver and C# in linguist and a cpp-driver for linguist C++
2. single driver already may support multiple "languages" (to a different extent)
e.g javascript-driver parses JSX+JS+Flow (but not JSON). Or cpp-driver for C and C++
Out of scope
This proposal intentionally does not touch on a more general problem of support for "mixed scripts" languages
like HTML+JS in a single document inside the same driver.
Context
Language names in Enry
come and are re-used directly from github Linguist,
any change on this side would complicate maintenance
Language names in Bblfsh
- are used in the input, as argument to the Parse() API requests [2]
- are used in the output, in protocol.v1 parseResponse.SupportedLanguages() [3]
- are persisted in each driver's manifest file [1]
- used by `bblfshd` in every request to route it to a driverPool \w right driver image [5]
* either from user inupt
* or from calling enry
* in both cases, results are "normalized" to mitigate problem #1 [4]
- some clients (e.g scala-clinet) does perform the same "normalization"
even before sending the request to bblfsd [6]
Language names for the downstream users:
- usage pattern 1
send every request to bblfshd. Is not affected by problems 1/2
- usage pattern 2
to reduce load on a bblfshd, in case of processing many files (e.g gitbase, spark-connector)
* detect language of a file with enry,
* check if it is in the list of SupportedLanguages() for this bblfshd instance
* only if it is, send the request
this hits the problem 1 hard, as right now language names are "normalized" only on input to bblfshd
Proposal
- on problem 2: document intended language/dialect support
Where would be the best place?
It's a bit different from aliases to the same language as it should be generated from manifest + include manually curated links to original language features spec
to clarify level support (e.g cases/features that are not yet mapped to Semantic UAST shape). Flow support in JS driver is the best case at hand here.
- on problem 1&2: add a new array field to driver manifest which would lists all the supported languages aliases/names.
e.g for C# language_aliaces: ["c#", "csharp"] and retrofit bblfshd inputpu/output code paths to use it for finding the driver pool.
Backward compatibility
- a new version of bblfshd should be able to work \w old version of drivers
if "language_aliases" is not present - use language name, as before
Rollout plan (does not need to be atomic)
- all drivers need to be released
- new SDK + bblfshs version need to be release
\w hard-coded user input "normalization" removed
- language-related documentation needs to be updated/re-generated
Side notes
- This task could be ideal test bed for the upcoming release automation [7]
that is discussed in that issue and can be a motivation to build a first version of it
- This task could also be a force-function to remove v1 protocol leftovers from the master and add
native support of the SupportedVersion request
Links: