The current syntax highlighting system is very slow, and there are noticeable lags when scrolling large C++ files which contain complex syntax elements.
Previously, most people suggest something like nvim-treesitter which will analyze source code in a background treesitter process and render keywords in the foreground with text-property.
But is it a good idea? I don't really think so,
at least 4 disadvantages for treesitter solutions:
changetick
increase to prevent such things, which is a little flaky.Background syntax highlighter is still immature, there are still many other strange issues in nvim-treesitter:
https://github.com/nvim-treesitter/nvim-treesitter/issues
If we introduce something like this, we shall take all these issues into account.
Syntax highlighting is the most important part of an editor, better not rely on any uncontrollable external programs.
We need some new things that can satisfy such goals below:
And TextMate's grammar engine is really a good candidate which is widely used in many IDE/editors, including vscode (see syntax-highlight-guide for details), sublime and many others.
VS Code uses TextMate grammars as the syntax tokenization engine. Invented for the TextMate editor, they have been adopted by many other editors and IDEs due to the large number of language bundles created and maintained by the Open Source community.
TextMate grammars rely on Oniguruma regular expressions and are typically written as a plist or JSON. You can find a good introduction to TextMate grammars here, and you can take a look at existing TextMate grammars to learn more about how they work.
The grammar can be defined in JSON, that means can be translated into viml or just plain JSON files.
We can specify which grammar engine to use for the given buffer:
And some new command can be used to change grammar engine:
:syntax grammar textmate
:syntax grammar default
:syntax load ~/.vim/syntax/cpp.json
for example, the snippet below can be included in the head of syntax files:
if has('textmate')
syntax grammar textmate
syntax load syntax/cpp.json
finish
endif
....
And lots of existing vscode/textmate syntax files can be reused with minimal modification.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub.
Triage notifications on the go with GitHub Mobile for iOS or Android.
Thank you for starting this discussion. I had a vague plan to look into integrating treesitter, it is good to know it also has disadvantages. Vscode is widely used, thus if it uses TextMate then there must be something good about it.
Comments welcome.
I'll just comment that I would take these comments about tree-sitter with a significant heap of salt.
The might be some misunderstanding here. Tree-sitter in neovim doesn't use an external process like coc.nvim. The parser runtime is a C library embedded into the editor itself (in total not more LOC:s than syntax.c + highlight.c in vim itself), and parses the buffer in memory and produces a syntax tree that in-process plugins can use (for highlighting but also for other purposing like text objects).
Right now the biggest problem with syntax highlighting is how inconsistent and unpredictable it is. An unified interface will be more than worth the effort.
TextMate will probably be better for keeping the syntax system more integrated & backwards compatible than using something like treesitter. Also the modular and overengineered plugin architecture of treesitter would be a huge departure from the way it is done right now, so we should be a little cautious about how much functionality to reimplement.
@bfredl How much longer does it take to load a larger file like src/evalfunc.c in Neovim when tree-sitter is enabled, compared to the default syntax highlighting? I'm assuming that default syntax highlighting is disabled for filetypes where tree-sitter is supported.
@bfrg src/evalfunc.c
from vim (10 000 lines) takes 80 ms more time with tree-sitter enabled for the initial parse (200ms compared to 120ms in my config)
treesitter is more than just syntax highlighting, it's also useful for text objects for example.
TextMate system is old, Sublime Text has been mentioned but it left it years ago to use its own syntax engine. Does it make sense to adopt a system that is already waning? And how big is its library if it must be included?
Also when saying that a system is more performant, some source/benchmark should be provided. Is it TextMate more performant than treesitter? Who says so?
@bfredl , thanks for figuring it out, and I made a new revision:
list of tree-sitter disadvantages:
nvim-treesitter
need to load an external shared library as the parser for each language, the shared library must be downloaded and compiled into .so
files (I know :TSInstall
can simplify these steps), building progress can break if gcc/clang is not installed, the plugin or neovim itself may break due to any common dynamic link library problems, eg: version incompatible when the plugin has updated but parser .so
files not, dependency conflict when loading the shared library.The biggest risk is parser quality, over 100+ open issues for parsers:
examples for inconsistency:
example for performance:
The parser quality problem is totally out of control, nearly impossible for us to fix all the parsers one by one.
@mg979 the core part of textmate syntax system is oniguruma, which is open source and well maintained by the community.
known editors / ides using textmate syntax system:
Monarch was initially built to support languages in VS Code. Then, they decided to switch for TextMate as well because of reasons listed here: microsoft/vscode#174 (comment) .
Some details:
VS Code's tokenization engine is powered by TextMate grammars. TextMate grammars are a structured collection of regular expressions and are written as a plist (XML) or JSON files. VS Code extensions can contribute grammars through the grammar contribution point.
The TextMate tokenization engine runs in the same process as the renderer and tokens are updated as the user types. Tokens are used for syntax highlighting, but also to classify the source code into areas of comments, strings, regex.
Starting with release 1.43, VS Code also allows extensions to provide tokenization through a Semantic Token Provider. Semantic providers are typically implemented by language servers that have a deeper understanding of the source file and can resolve symbols in the context of the project. For example, a constant variable name can be rendered using constant highlighting throughout the project, not just at the place of its declaration.
Highlighting based on semantic tokens is considered an addition to the TextMate-based syntax highlighting. Semantic highlighting goes on top of the syntax highlighting. And as language servers can take a while to load and analyze a project, semantic token highlighting may appear after a short delay.
The tokenizer of vscode/textmate is:
And here is the wrapper in javascript, it's neatly written and not hard to understand:
All we need to do is rewriting the javascript wrapper in C,
And thousands of textmate syntax files are ready to use.
No more than 4854 lines (including comments) in javascript/typescript
Tests excluded, it's 3779 lines of code (source: cloc(1)
).
Tests included, it's 5074 lines of code.
Why not Sublime grammar instead of TextMate grammar? It seems more powerful, and easier to read.
I think .sublime-syntax is more easy to write and readable.
Sublime text 3 has implemented a new grammar format that seems much better than the traditional textmate grammar.
Is it because there have been fewer .sublime-syntax
files written than .tmLanguage
ones? Is there a licensing issue with these files?
@lacygoill maybe textmate grammar is a little easier ? because there are reference implementations:
But sublime is closed source ? we need write it from scratch ??
But sublime is closed source ? we need write everything from scratch ??
Good point. I forgot that sublime was closed source.
Is TextMate much better (readibility, reliability, performance) than our current syntax highlighting mechanism?
Just for TypeScript alone, there have been 754 reported bugs, 41 remaining open currently.
Assuming we support TextMate, what would happen to our current issues related to syntax highlighting? Do we close them, and tell their authors to use the new syntax highlighting mechanism? If the users find issues in TextMate grammar files, do we accept their reports on this bug tracker? IOW, is it going to help reduce the number of remaining open issues here?
Because TypeScript is a new language that evolve quickly ?
Oniguruma + json like config is certainly faster enough than current vim's mechanism. People seldom encounter performance issues in syntax highlighting when using textmate/vscode/eclipse/jetbrains.
Sublime's grammar seems more readable and powerful than textmate, maybe oniguruma+config can achieve such thing.
I remember an issue where Vim was very slow when adding/removing text properties on CursorMoved
. It only occurred while the syntax highlighting was enabled. So, one might think that the latter was the culprit. It turns out that the syntax highlighting was fine; the issue was Vim redrawing the screen too much.
With regards to how people perceive the current syntax highlighting as being too slow, I wonder which part of the issue comes from the syntax highlighting itself, and which part from something else like (too much redraw).
People seldom encounter such issues in syntax highlighting when using textmate/sublime2/vscode/eclipse/jetbrains.
That's interesting. I hope it's really thanks to their own syntax highlighting mechanism, and not some other optimizations (like multithreading).
A couple of remarks:
after/syntax
, would it be possible to do that with TextMate as well?I think performance of vim syntax highlighting could be improved before trying alternatives, for example:
I want to add that we currently have no safe-guards for tree-sitter that are applied for regex-based highlighting like limiting the line number or doing background parsing like Atom would do.
Background syntax highlighter is still immature
I think background syntax highlighting (if you refer to asynchronous or separate threads highlighting) is neither implemented for tree-sitter nor for traditional vim highlighting. The possibility to make a fast thread-safe copy of the parsing state for tree-sitter or any other kind of multithreading is not used at the moment in Neovim.
Many of the issues you cited complained about features missing due to missing :h syntax
. It will always be difficult to transition from one syntax system to another especially when it is so widely supported like vim syntax/fold/indent files. Maybe it would be easier to maintain more compatibility with a system that works more similar.
About quality of the grammars, you surely have different trade-offs. VS-Code has significant more users than Atom and Nightly-Neovim. Tree-sitter parses the whole document which can help with complex syntax constructs and large-scale structure. However, it will easier get confused when it sees something that cannot be handled be the language grammar (preproc-constructs or non-standard language extensions) while regexes with a more local view are often still ok. The error recovering capabilities vary a lot on how the concrete grammar is written. Tree-sitter provides something in-between regex highlighting and LSP-like semantic highlighting, so it might not be necessary if the two latter are available for a language. Distributing binary is another challenge for tree-sitter. Arbitrary code execution through custom scanners enables highest flexibility but may also pose a security risk though if the parsers are not self-generated and the scanner code is not reviewed.
For those who haven't seen it, this is an excellent introduction to Tree-sitter, by the author: https://www.youtube.com/watch?v=Jes3bD6P0To&ab_channel=StrangeLoopConference
tl;dr: Tree-sitter is a (portable, dependency-free) C library which (conceptually) takes a grammar (expressed in JavaScript) and a source file, and returns a parse tree for the source file with respect to the grammar. The big selling point is that TS (claims that it) can handle syntax errors well (still return a reasonable parse tree) and that it is incremental (returns new parse trees efficiently/quickly given some code edits and previous trees).
Parsers for different languages are provided by the community and while I haven't seen this first-hand, I find it easy to believe that many of them are not great. But the project is much younger than TextMate, and GitHub uses it for its on-web syntax highlighting so there might be some corporate support there.
Personally, the thing I would be most excited about seeing is Vim exposing a representation of the syntax tree which can be used not just for syntax coloring but also for semantic editing (expand visual selection one AST node up, copy function body, etc.). IDK how well the Vim architecture supports this today. But in theory you could then plug in whatever parse-tree-generator you choose (Tree-sitter or TextMate).
If you are using an LSP language server, it's true that the LS can give you a parse tree (on which is even more accurate, esp. in the case of context-sensitive grammars like C++), but language server (which Vim also doesn't natively support yet) will always be slower (it will do more than a parser, for example it will resolve cross-file deps and so on) and therefore will have to be async and higher-latency. So I think there is room for both a fast incremental parse system (like Tree-sitter) and LSP support (for things like go-to-definition and find usage).
See also this discussion in the VSCode repo: microsoft/vscode#50140
As someone who has spent months writing and maintaining TextMate and tree-sitter grammars for real-world languages, let me tell you that the TextMate grammar system is totally broken, at least from a 2021 perspective. TextMate grammars are a nightmare to maintain and impossible to get right. Out of desperation, I even developed my own macro system (just like the authors of TypeScript's TextMate grammar), and it was still a nightmare.
tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).
Betting on TextMate grammars in 2021 would be an engineering crime.
I am not sure how much of your hyperbolic speech can be deemed accurate, but from what I can see one of the biggest problem with tree-sitter is the general low quality of parsers contributed by different people as pointed out by the OP. "Top-notch" is not the way I would describe it. Which certainly needs to be taken into account as it would require a vast amount of effort to deal with these issues Vim would inherit as a result of undertaking the HUGE project of integrating tree-sitter.
I can't speak for textmate grammar for lack of familiarity. Personally my biggest problem with tree-sitter (at least the way neovim does it) is it's dependency on the environment (gcc/clang), large binary size and the do-it-all mentality which suits neovim but definitely does not feel like the "vim way".
tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).
If tree-sitter is top-notch, how come an ubiquitous and highly popular language like python is broken in it since quite a while?
When I tested neovim 0.5.1 with tree-sitter I ended up having to disable TS for python (which is the language I use the most) because the indenting and highlighting were unusable. Doesn't exactly inspire confidence.
I think this discussion is devolving more and more from the purely technical and into prejudices. It is very important here to distinguish
tree-sitter
(the engine, which I would agree with @fcurts is an excellent piece of software and fundamentally superior to other syntax engines);I think Vim should at this stage focus on 1. to make a reasoned decision (while it of course makes good sense -- and would make me very happy -- to take Neovim's approach and decisions for 2. into account; admitting that the two projects have different needs).
And I find it highly disingenuous to point fingers at 3. while ignoring that the quality of TexMate grammars (and, indeed, Vim's bundled syntax files) varies wildly as well. It's clear that (just like Neovim) you cannot simply switch engines and have to support both (on a per-language basis) for some time until the replacement catches up.
I was obviously talking about the engine, which is what matters in the long run. Regarding existing grammars, the difference is that tree-sitter grammars can be improved relatively easily because they can be reasoned about. On the other hand, improving real-world TextMate grammars is anywhere from difficult to impossible. (Often, fixing one problem causes an inexplicable problem somewhere else, which is only discovered later.)
I can't comment on integration aspects. I'm not even a Vim user. But as a language/tooling developer myself, I feel strongly that it's time to move past TextMate grammars, which is why I offered my insights. Good luck!
If tree-sitter is top-notch, how come an ubiquitous and highly popular language like python is broken in it since quite a while?
When I tested neovim 0.5.1 with tree-sitter I ended up having to disable TS for python (which is the language I use the most) because the indenting and highlighting were unusable. Doesn't exactly inspire confidence.
@jgb Indentation has nothing to do with tree-sitter itself. There is a very ad-hoc implementation of using the parsed tree as indentexpr. Python indentation is not working because this implementation just considers the current syntax node you are currently on which is nothing in case of the Python parser because the relevant syntax node ended in the previous line when you start a new one. One would have to add a rule that respects this case or tune the general logic at this point.
You always have to write some system that translates your parsed representation to indents. The quality of this translation says nothing about the quality of the representation itself.
As someone who recently spent some time writing a TreeSitter grammar, I have also become less enthusiastic of the project. I watched the author’s presentation a while ago and it sounded like the greatest invention since sliced bread, but in practice it doesn’t always work that well.
The biggest obstacle in my opinion is languages with preprocessors (e.g. C and C++). This isn’t something I had considered initially, but it is simply impossible to parse those languages with TreeSitter because you’re dealing with a language within a language. Now before someone mentions this: I know TreeSitter supports injections, e.g. JavaScript in HTML, but that’s not the same thing because, as I understand, each injection is essentially its own “program”. It’s fundamentally not possible to parse pre-processed languages with a context-free grammar. If you think about it, conditional compilation is as context-sensitive as it gets.
I’m talking about constructs like this:
#if FLAG if (foo) { #endif bar; #if FLAG } #endif
Or this:
#define BEGIN_FUNC void () { #define END_FUNC } BEGIN_FUNC bla; END_FUNC
Or this:
#define RENAME(x) renamed_ ## x void RENAME(my_func) { bla; }
How is TreeSitter supposed to generate an AST for such code if it doesn’t interpret the macros? It’s simply impossible. And often this will result in parse errors. Now, TreeSitter is in theory “fault tolerant”, so it should be able to recover from errors, but I’ve found that it often recovers in a weird, unpredictable way that causes syntax highlighting to be messed up. It gets even worse when we’re talking about using it for features like syntax-aware selections, indentations and folds: Just forget about it.
All TreeSitter grammars for preprocessed languages contain hacks to work around this issue, but they never work 100%. They just handle a few special cases, but blow up in the general case.
The next problem is that parsing is incredibly slow. I benchmarked parsing a 4 MB file and it took over a second. Depending on where you are coming from, that might not sound too bad, but 4 MB a second really isn’t impressive when you consider that modern RAM can handle tens of gigabytes per second. Quite frankly, I’m not sure this “incremental parsing” approach is all that useful when the implementation is so slow in practice. I guarantee I could write a hand-rolled parser that would just reparse the entire file on every edit and it would still be orders of magnitudes faster.
I’ve also found that syntactic highlighting doesn’t actually add that much value over a simple lexer, but it is significantly more complex. Semantic highlighting on the other hand is even more complex, but it also adds a lot of value. If I had to rate the cost-benefit relationship, I’d say: lexer > semantic > syntactic.
If I had to design a syntax highlighting system from scratch, I’d probably just go with a simple C API, something like this:
typedef enum {TOK_IDENT, TOK_STRING, TOK_OPERATOR, ...}; void highlight_tokens(const char *buf, size_t len, Token *tokens, const void *input_state, void *output_state, size_t state_size);
You just pass a chunk of data to the parser and then it returns a buffer with a character class for each character (or maybe an array of ranges, see also LSP for a similar approach). This is the most general form, giving you the greatest amount of flexibility. You could hand-roll a parser, or build one based on regexes or TreeSitter grammars or whatever. It doesn’t restrict you to a particular system.
I’d even consider getting rid of the state persistence stuff and just pass one large buffer containing the entire file and reparse the whole file every time. Because in the general case, you have to do it anyway. Consider putting a comment /*
at the beginning of a very large file. No matter what you do, sometimes, you’ll have to reparse everything, so I’m not sure it is even worth adding complexity to save time for only some edits. Better work on making the parser really fast. Computers are fast, it shouldn’t take that long to parse even a 100 MB file. And source files are usually much smaller than this.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Anyone who eagerly promotes tree-sitter here should answer my questions above first. Repeating its advantages a thousand times do not mean that these fatal problems will disappear.
Tree-sitter is not a new thing, no need to be so excited. Remember that Atom has adopted tree-sitter early in 2018, users in the atom communities are very calm about this "new" feature.
I don't need a better highlighter at the cost of perfomance and flexibility. Because I am suffering performance issues right now and all I want is a fast & static regex-based highlighting.
@lacygoill you claimed in this comment that the problem was caused by "drawing too much".
That's not true, I have done a bisect investigation in this problem here:
And found that there was a big performance regression after 8.0.643
and 8.0.647
. You can simply compare syntax highlighting speed difference in both vim 7.4 and the latest vim 8.3.xxxx and you will find that this is by no means a simple "drawing too much" problem.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
@lacygoill you claimed in this comment that the problem was caused by "drawing too much".
That's not true,
It is. The patch that fixed my issue only reduced how often Vim was redrawing the screen:
vim9script diff --git a/src/textprop.c b/src/textprop.c index b6cae70a8..e74c13849 100644 --- a/src/textprop.c +++ b/src/textprop.c @@ -809,6 +809,7 @@ f_prop_remove(typval_T *argvars, typval_T *rettv) int id = -1; int type_id = -1; int both; + int is_removed = FALSE; rettv->vval.v_number = 0; if (argvars[0].v_type != VAR_DICT || argvars[0].vval.v_dict == NULL) @@ -889,6 +890,7 @@ f_prop_remove(typval_T *argvars, typval_T *rettv) if (both ? textprop.tp_id == id && textprop.tp_type == type_id : textprop.tp_id == id || textprop.tp_type == type_id) { + is_removed = TRUE; if (!(buf->b_ml.ml_flags & ML_LINE_DIRTY)) { char_u *newptr = alloc(buf->b_ml.ml_line_len); @@ -920,7 +922,8 @@ f_prop_remove(typval_T *argvars, typval_T *rettv) } } } - redraw_buf_later(buf, NOT_VALID); + if (is_removed) + redraw_buf_later(buf, NOT_VALID); }
As anyone can see, the patch did one thing, and one thing only: it put a condition on redraw_buf_later()
; the latter can only be invoked if is_removed
is true:
if (is_removed)
It did nothing else. And yet, it was enough to fix the issue.
I have done a bisect investigation in this problem here:
Syntax highlighting is extremely slow when scrolling up in recent version (v8.0.1599) #2712
This has nothing to do with my comment. It's an entirely different issue. The only way your comment might be relevant would be if I had written:
whenever Vim is slow, it's because it redraws the screen too much
But I did not say that. And the comment you link did not say that either.
I wrote that in my issue, the cause was too much redraw.
I did not write that in all issues, the cause was too much redraw.
Two last notes before I unsubscribe from this thread.
Asking for questions or clarifications is OK, but saying that I lie is not. I don't want to read anything from you anymore, so I've blocked you.
I don't care whether Vim integrates tree-sitter, TextMate, or whatever software is trending right now. All I care is how reliable Vim is.
—
Reply to this email directly, view it on GitHub.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are subscribed to this thread.
@lacygoill, Sad to hear that. I have been following you on Github for years, reading your posts in the issues, and studying your early vim9 plugin projects. What I mean was nothing more than "your speculation may be wrong". Complaining that I complained you "lied" was a little overreacting.
You just blocked a faithful follower.
—
Reply to this email directly, view it on GitHub.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are subscribed to this thread.
I made the textmate parser portable. Removed the osx foundation codes. It may be worth a test as a vim plugin. Making one is beyond my skillset.
https://github.com/icedman/tm-parser
This library works well on my editor projects, including an ncurses based editor. Works well enough with my Flutter app als. Ashlar Code app for Android (munchyapps.com)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Last year I wrote a plugin to highlight things with LPeg; I choose LPeg because I liked the way it works, and vis already uses it and there a reasonable number of syntax files already available.
I got bored with it and never finished/published it; I think there were still some remaining issues, but I forgot what they are/were. I think most were related to using text properties to apply highlights, rather than LPeg itself, but not sure. Maybe I'll work on it some more and get it to at least a "publishable/experimental" state.
I also spent quite some time looking at tree-sitter; actually, that was what I originally wrote the plugin for, and came to the conclusion I don't care much for tree-sitter, or at least not for editors. One of the really great features of Vim's current syntax highlighting is that it's pretty easy to modify by users. Based on my experience answering questions on the Vi Stack Exchange people want to do this all the time: they want to highlight some keywords as errors; don't like how this or that is highlighted and want something different, they want to highlight their own project-specific things, etc. Tree-sitter makes that much harder, and I'd consider it a huge UX regression.
Even in "normal" usage there's an entire circus around managing it for end-users; you can't just "drop a file in ~/.vim/syntax/mylanguage.vim" or "~/.vim/after/syntax/mylanguage.vim", you need to compile shared objects with NodeJS and whatnot. The nvim-treesitter plugin manages all of that for you, but a plugin to manage all the circus is putting lipstick on an ugly pig IMO.
I also don't like the way tree-sitter syntax files are written in the first place; other people mentioned that many tree-sitter highlights aren't all that great, and that matches my experience too. My first instinct was "okay, so let's improve this!" but I found that quite hard and gave up after mucking about for a while with very limited success. I think that syntax being hard to write in tree-sitter is probably the reason so many syntaxes aren't so great in the first place. I certainly don't see how tree-sitter is "fundamentally superior to other syntax engines" as someone mentioned in this thread; this seems like some true-ism that keeps getting repeated, but I've seen any reasons why this should be the case (and I did try to find reasons).
Overall I do think the "tree-sitter approach" of more structured parsing is the better approach, I just don't think that tree-sitter is an especially great fit for Vim. I don't know why Neovim went with tree-sitter specifically: as near as I can determine it's just because someone wrote a patch for that – I couldn't really find any discussions about it. Interestingly Neovim does use LPeg internally for some things, I don't know if it was considered – or maybe it was, I very well may have missed some discussions somewhere.
I don't have any opinion on TextMate's system, as I didn't look at it, but when I started working on all of this and evaluating options I wrote down the follow requirements:
Reasonably fast, even for large files, and it doesn't break.
Reasonable easy to modify, including by "normal" users such as sysadmins, scientists (in fields other than comp-sci), and just regular hobbyists who are not professional developers.
Readability and maintenance is important. Right now syntax files are a bit of a "write only, hopefully never read"-affair.
Easy to manage, it should "just work" after dropping a new file in your ~/.vim/ without muckery.
There are a million-and-one parser generators, tools, and so forth out there. It's literally people's entire career to research these kind of things and write tools for them.
Many of then fit requirement 1 ("fast and correct"), but most of them are not especially user-friendly. EBNF (and variants thereof) are more or less the standard for describing languages, but do you really want this as the basis for your syntax highlighting? Probably not.
This is actually a great feature of the current syntax system: you can add, remove, and modify things fairly easy. "I don't like this highlight" or "I want to add a new highlight for X" should be something a fairly experienced dev can do in under an hour. LPeg mostly retains this feature: you can still say "yo dawg, highlight this for me, kthxbye" or "eww, I don't like this, get rid of it!" and be done with it.
Without detailing all the solutions I looked at, I eventually settled on LPeg because of all the solutions I found I felt it had the best combination of correctness and UX.
I still think these are good requirements. It's quite possible there are existing tools out there that do a better job than LPeg, but IMHO tree-sitter very much doesn't.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
I put my LPeg plugin over here: https://github.com/arp242/lpeg.vim
Like I said in my previous comment, I haven't worked on it for quite a while, but I did some spot-checking and seems to work fairly decently. Much of it is stolen^H^H^H^H^H^H inspired by vis.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
If anyone wants to have a look on a native (=compiled without runtime dependencies). I would have a look into bat
https://github.com/sharkdp/bat. They use a native implementation that reads texmate grammars called synctex https://crates.io/crates/syntect/1.7.1 which is probably good enough to try it out in vim before implementing a C implementation.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
@theHamsta , sublime grammar is also a good choice:
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
tree-sitter's highlights have a lot of quality issues, Syntect and therefore TextMate's grammar in Vim would be a game changer, as least for me. And since Rust has a very good FFI for C, I think it might be a feasible endeavor to integrate Syntect's lib with Vim's.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
In spite of tree-sitters poor parser quality, the fatal issue of the tree-sitters highlight is portability.
If we want to encourage people to create diverse syntax highlighting, we must provide something simple, straightforward, and easy to learn for most users.
When we are using text-based grammar files (vim syntax/TextMate/Sublime syntax), it is very easy to make modifications and create a new one. For example, I can change the cpp.vim
to a new version to highlight some keywords/rules dedicated to my project or to meet the latest c++ standard if the original author is too busy to update.
While, the tree-sitter's syntax highlighting rule is hard-coded into the parsers, even if you want to make a small change. You are required to change the parsers yourself and build a new .so
file for your target platform.
Changing a parser is much more complex than changing a text-based grammar file.
BTW: Tree-sitter is written in rust.
So far as I know, many vim users still don't have a gcc environment to build Vim themself.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
In case @skywind3000 decides to edit his post: he is boldly claiming that (1) Tree-sitter is written in Rust, (2) you have to write Rust code to create TS grammars, and (3) you cannot change TS highlighting at runtime. These claims are all very false. Since he has shown that he is willing to completely make things up to support his point, anything he says should be taken with a grain of salt.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
@skywind3000 not all of them know Rust but since tree-sitter has a C core, wrapped in Rust (like Deno's V8) and because that gets built and delivered as a npmjs package, grammar authors do their thing in JavaScript.
But a TextMate compatible parser, like Syntect could probably be less of a hassle for the end user, just use those .json syntax files from VSCode/Sublime Text, modifying it would mean just editing some .json
.
As it stands today, if you dislike a tree-sitter highlight, to change it is required writing a subset of scheme
and/or tweaking tree-sitter using bindings in a language supported by your editor.
Even then, you'd also need to get knees deep into the third-party provided tree-sitter grammar. Só although easy to make them, editing them isn't as easy as extending a .json syntax file, like Sublime/TextMate grammars.
On editor ecosystem, having a TextMate grammar means much less work to port extensions from TextMate, VSCode and Sublime Text to Vim then with tree-sitter.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
On editor ecosystem, having a TextMate grammar means much less work to port extensions from TextMate, VSCode and Sublime Text to Vim then with tree-sitter.
It's no skin off my nose either way, but just for the sake of completeness: going with tree-sitter would mean even less work porting from Neovim -- a "sister editor" that explicitly strives for Vim compatibility?
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
It's no skin off my nose either way, but just for the sake of completeness: going with tree-sitter would surely mean even less work porting from Neovim -- a "sister editor" that explicitly strives for Vim compatibility?
@clason, going with tree-sitter means, for now, choosing Neovim compatibility over Sublime Text, VSCode and TextMate.
Following conventions means more features/innovations from other tools that follow those same conventions can be introduced into Vim with less work. Like Language Server Protocols conventions.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
@clason , I admit that I am not aware of the parser generation part of tree-sitter, it is indeed my mistake to state it was written in rust.
A mistake is a mistake, I will not edit and revert my post.
But my core point still stands:
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Maybe the first thing to do is a syntax highlighting interface, SHI, for vim
. It could be set up such that if something adheres to the interface, it could be compiled with vim, added as a shared library, there can be an LSP adapter for SHI. The interface could support async/concurrent operation.
It's been mentioned that there are additional uses for a true language parser, such as folding info. Is it reasonable or useful to have multiple SHI active at one time? Internally vim
could synthesize/merge the results from multiple sources.
Considering the heated interest in this topic, maybe Syntax/Highlighting Interface Tools, is better or more accurate.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I went ahead and made a Textmate plugin. It is currently for nvim though.
https://github.com/icedman/nvim-textmate
Coded in c/c++, lua, uses a modified version Macromate's opensourced textmate app
No where ready but the speed already looks promising.
The syntax highlight output is similar to Treesitter. Treesitter has some other cool features. But it crawls when editing or even just opening large files. Example: Amalagamated sqlite3.c source 200k lines.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
textmate-based syntax highlighting for vim
https://github.com/icedman/vim-textmate
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Does it mean that we can do everything through a .scm file without changing the parser ?
You can select every part of the parsing result and define relations between different CST nodes (e.g. select a function that has three arguments with the third starting with a vowel). You are a bit limited that you can only select nodes of the syntax tree, not individual characters directly (without custom functions).
Custom functions can be interpreted by the editor. This can be used to select also subranges of a node or use custom logic to filter out results. That's usual enough for syntax highlighting. In neovim's implementation, you can register the mentioned custom functions via Lua.
what if language standard evolves ? still no need to change the parser ??
Yes, changes of the language requires to update, generate and compile the parsers. Like textmate grammars, the parser definitions are shared between editors. With the tree-sitter integration into Neovim the community got quite active, so typically new features get added quite quickly. In the case of nvim-treesitter, each plugin revision contains lockfile with parser revision we have tested on CI to be compatible with the highlight queries (when new language features should get highlighted, they must be referenced in the *.SCM files unless the parser editor chosen to reuse already present structures). The parser get updated and compiled at the end users side as soon as the feature went through our CI and got committed (rolling release). Other distribution strategies include to manage the parser via a plugin manager or via binary releases (parser pack, or via the regular release of Neovim that includes the parser for the C language with more to be added).
The parsers usually use terminology of the language specification and can re-use BNF-languguage specs if available. So there is mostly no need for customization as customization can be done via SCM files and the parser just follows official specs or existing parsers for the language. New parsers might have frequent updates in beginning until they cover all features of a language but at some point they are usually complete and only have few commits in a year. Like with syntax files it is definitely not necessary to be always on the latest revision.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Try editing this in neovim with treesitter on:
https://code.jquery.com/jquery-3.6.1.js
10,000+ lines of code
Try even scrolling through sqlite3.c in neovim with treesitter
200,000+ lines of code
Plain vim has no problem with these files. Granted, it would be rare editing very large files. But when something vim could do previously well is no longer possible - it should be considered a regression.
The title of the proposal is simply a better syntax highlighting.
Treesitter should be another proposal or something for the future - perhaps when vim runs on multithreads.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Plain vim has no problem with these files.
Yes, because Vim has a parsing timeout and a limited parsing window, which tree-sitter in Neovim does not (yet). It's important to compare apples with oranges here. Unqualified claims like
And textmate is the best answer.
do not help; at the very least I would have expected a benchmark here comparing (fairly!) the timings between regex highlighting, nvim-treesitter, and your textmate plugin for these files.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Adding a new engine is going to be something that needs to be done properly, it is a big investment.
I hope that if/when the time comes for working on this, it is thought of as
Adding an engine interface allowing different implementations to be used
As discussed in #9087 (comment)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Some information:
I was reading vscode's latest documentation and found that:
At last, it seems like that vscode didn't choose to integrate tree-sitter directly, but provided
some APIs to allow extensions to provide new highlighting solutions:
Currently, vscode has two highlighting solutions:
Semantic highlighting is an addition to syntax highlighting as described in the Syntax Highlight guide. Visual Studio Code uses TextMate grammars as the main tokenization engine. TextMate grammars work on a single file as input and break it up based on lexical rules expressed in regular expressions.
Semantic tokenization allows language servers to provide additional token information based on the language server's knowledge on how to resolve symbols in the context of a project. Themes can opt in to use semantic tokens to improve and refine the syntax highlighting from grammars. The editor applies the highlighting from semantic tokens on top of the highlighting from grammars.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
The textmate parser everyone keeps referring to relies on the oniguruma regex library which contains approximately 80k lines of code. Is this even an option to integrate it into Vim? Users will have to learn a new regex flavor just for writing syntax files. On the other hand, if Vim uses its own regex engine, all the existing textmate syntax files won't work, or will they?
I would like to see a comparison between Vim's syntax highlighting and textmate for a more complicated filetype, like C++, bash or similar. The author keeps suggesting textmate but hasn't shown anything (at least a screenshot comparison). Where does textmate shine exactly? And what exactly is easier express in textmate's syntax files?
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Yes, because Vim has a parsing timeout and a limited parsing window, which tree-sitter in Neovim does not (yet). It's important to compare apples with oranges here. Unqualified claims like
Just run an eye test like I said. Try opening the said files. You do need to make a benchmark.
Re: Textmate is the best answer (yes - is probably biased). Let me change that - Textmate is the best immediate solution. Virtually everyone else uses it because the top IDEs use it - sublime text, atom, vscode, intellij (i think). Hence, it already has a wide language coverage.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
This means the parser must be able to start at some point in the file.
It may look back for a point to synchronize, but always starting at the
top of the file isn't going to be sufficient.
Treesitter - from what I understand - and I used it a little only - always starts from the top of the buffer and requires access to the entire buffer. (correct me if I'm wrong).
You could make subsequent parse faster by telling it which parts of the buffer has changed before running the parse. Treesitter is from the Atom guys (in think). Atom runs the parser with its fast buffer snapshot feature and on a separate thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Virtually everyone else uses it because the top IDEs use it - sublime text, atom, vscode, intellij (i think).
I think virtually everyone is a bit of an exaggeration
Atom is dead: https://github.blog/2022-06-08-sunsetting-atom/
Do you have a source for intellij? Looking at https://plugins.jetbrains.com/docs/intellij/implementing-lexer.html and given that there is a textmate plugin: https://www.jetbrains.com/help/idea/textmate.html#import-textmate-bundles I suspect by default it doesn't use textmate
That would leave vscode and sublime text, when considering the more popular editors.
Looking at tree-sitter:
And although vscode doesn't use tree-sitter, it's still used at Github: https://github.blog/2021-12-09-introducing-stack-graphs/
This should at least give some confidence that tree-sitter is a) not dead, and b) the quality of the parsers will be improved, and vim joining the efforts could help
Let me change that - Textmate is the best immediate solution.
Best by what metric?
Regarding the performance:
It could be true that the initial parse with tree-sitter is slower (numbers?), but I think for a fair comparison one also needs to take re-parsing into consideration when making edits to a document. Given that people use editors to edit documents, that's kinda important.
And one of the goals of tree-sitter is that:
I think actual numbers would help this discussion a lot
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I think actual numbers would help this discussion a lot
Yes, actual numbers would be much better. But then you'd have to code something first for vim.
So I went ahead and made a treesitter plugin.
https://github.com/icedman/vim-treesitter
Time is well spent at coding than debating. I'm a lawyer by the way ;)
The plugin is highly experimental. And this implementation currently cheats. It doesn't parse the entire buffer - but only whats visible - with some look aheads and look backs. This way it can jump and parse anwhere in the doc.
It can open sqlite3.c (200K lines ) and jquery (20K lines) without a problem. It has some artifacts where the portions of the parsed buffer results in error
This is also still very inefficient as it constantly re-parses the entire visible buffer. But the treesitter parse is indeed fast.
I will probably attempt parsing the entire document and updating the tree the way the library is supposed to be used.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
When considering tree-sitter performance: the experience will be totally different depending on the parser (and the grammar rules that created it). Also incremental parsing performance depends drastically on the parser, some will invalidate the whole buffer on certain characters while this will never happen for other grammars. As @icedman commented, even whole buffer parsing on each keystroke is possible for most parser (Neovim has no incremental parsing for injected languages yet, but helix does).
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
This is also still very inefficient as it constantly re-parses the entire visible buffer. But the treesitter parse is indeed fast.
i tried to build and setup your vim-treesitter plugin, (though only build for ..._c
),
and using (testing on) vim/src/main.c
as example:
1, not sure if it was more (hi correction) accurate than vim native syntax (maybe 'yes'?)
2, but to perf, i saw it beat cpu very much when i kept pressing j
from top to bottom.....
// my laptop maybe not happy on that............. :-)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
treesitter parsing is more accurate.
But it is slow especially huge C files.
So I cannot use treesitter for C code.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
The scope to vim highlight mapping in the vim-textmate plugin and the node type to vim highlight mapping in the
vim-treesitter plugin are both very incomplete. Visually, you wouldn't appreciate the difference.
But you can run :TxmtDebugScopes and :TSDebugNodes to see what the parsers see.
I did the treesitter plugin not such to test its speed. But to see whether a special mode - "cheat mode" for large documents is possible with treesitter (as mentioned by Bram). This is where parsing is not done on the entire document but only partially. It turns out this could be possible. Treesitter is very "fault tolerant" as advertised.
The error portions of the tree, can be handled by the native vim syntax highlighter.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
But you can run :TxmtDebugScopes and :TSDebugNodes to see what the parsers see.
to textmate, maybe more interesting on its syntax accurate (hi correction) with its existed resource, i guess...
// vim native syntax fs sometimes was broken (specific ft), that would be a alternative way as 119 help........... :-)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I did the treesitter plugin not so much to test its speed. But to see whether a special mode - "cheat mode" for large documents is possible with treesitter (as mentioned by Bram). This is where parsing is not done on the entire document but only partially. It turns out this could be possible. Treesitter is very "fault tolerant" as advertised.
It could be done though architecture of tree-sitter was build to exactly not needing to do that and avoid the artifacts that this windowing has when it comes to paired syntax tokens like {
}
, "
/"
and (
/)
for Lisp.
Tree-sitter allows to set a deadline in microseconds for parsing and querying there's a guarantee that the function call takes no longer than the limit you set. The idea was to do synchronous parsing with a deadline set and then fork the parsing to a background thread. Each parsing state is a immutable (copy-on-write state) so the asynchronous parsing can race against another synchronous request. The continuing of a deadlined parsing that will keep the achieved progress. So for a really big file you would need to wait a bit until the first asynchronous syntax highlight and then update using synchronous incremental parsing. Neovim does not use the deadline feature or asynchronous parsing yet. tree-sitter-c needs 240ms to parse itself (2.4MiB), tree-sitter-cpp needs 900ms to parse itself (9.8MiB). It should be possible to wait for that as long as it does not block your typing. Highlighting can still be done windowed (querying and setting highlights) but it can use the accurate AST of the whole file. You could even use both strategies and use the windowed parsing approach only until you wait for the whole file parsing to finish in the background.
If you edit the file you either have small changes that can be incrementally parsed or you paste 10MiB into the file that will trigger background parsing.
It is of course a question of preference on whether the windowed approach is already good enough for you.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Neovim does not use the deadline feature or asynchronous parsing yet.
So this is why it is very slow on large files. Even querying the tree (merely moving the cursor around) looks like slow blocking calls.
You could even use both strategies and use the windowed parsing approach only until you wait for the whole file parsing to finish in the background.
That is the idea. Native syn can augment the highlights. And using the complete AST when it becomes available or syncd is a great idea.
I'm thinking of implementing the incremental updates. But it looks very cumbersome to do through a lua-and-C plugin.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Providing an interface shifts the problem elsewhere
(for discussion say scanner
is regex and a parser
is grammar;
either can be the basis of a language plugin
)
I would think that part of developing the interface would be to
provide an implementation based on a winner: TreeSitter, TextMate or
some dark horse.
A benefit of an interface is that vim internally can focus on
integrating/using results and supporting optional async operation;
like managing the partial/window results as discussed by @icedman. A
complex language like c++ could have both scanner and parser
solutions. Scanner results are used until parser results are
available. In addition, a language plugin can provide info for
folding, indent, ...
I'm assuming the current syntax files, VimSyn, still work; VimSyn is
the first implementation and supports many languages. This thread is
all about a 2nd implementation, formalizing how an implementation
interfaces with vim seems worth the effort. The extra effort might
even be small compared to the overall task.
someone who wants to support a certain language (without spending
too much time on it) creates extra decisions to be made
Isn't it sufficient to say do it this way first. With a brief
comment to look elsewhere for other more complex techniques. If the
"certain language" is simple then the default method probably works
well, if it's complex then the extra decisions are probably pretty
minor compared to the overall task; and probably worth taking the time
to consider.
In many/most cases a simple scanner solution, VimSyn, is good enough;
some cases are greatly enhanced by a parser solution (inherently more
complex and time consuming to implement). Only one implementation
seems a problem; either inaccurate results or too much complexity.
Additionally, there might be existing parser/scanners that aren't done
in the single integrated language plugin implementation or some new
general parser/scanner might emerge. If someone is willing to make it
available to vim, it's difficult, if not impossible, to do if there is
no vim interface.
If a substantial effort is going to be made in this area, locking in a
single 3rd party solution seems the wrong choice.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Here is another investigation into Textmate. This time in Ruby:
https://github.com/icedman/vim-textpow
This is very very fast compared to the lua with C version I made earlier. This uses a very old textmate implementation (github.com/grosser/textpow). It still requires some update. But the code looks very maintainable at only 844 total lines, including the vim plugin with a bonus that Ruby looks nice too the eye. Oniguruma is baked into Ruby 2.0. It is not that obscure afterall.
The lua with C version:
https://github.com/icedman/vim-textmate
This is more complete and can render textmate themes. But this lags when scrolling too fast with pageup or pagedown - I needed to employ some defered highlighting cheats. This also needs build tools to compile the C module and not as self-contained as the textpow version.
I guess textmate can live as a ruby or lua plugin until a new syntax highlighter is developed or the current one improved
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Someone also pointed out to me this project
https://github.com/trishume/syntect
Claims to be a fast textmate parser highlighter. It is in Rust.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Yes, the premise of this report has always seemed a bit dubious to me. I was going to investigate further and actually run some tests but haven't done so. Thanks @icedman.
The only real advantage I see to supporting TextMate would be the corpus of grammar files. It has a couple of nice features that are missing in Vim like the ability to specify submatches limited to the start and end regions but these could be added.
I occasionally investigate specific TextMate grammars to see how someone else might have handled a tricky case I'm trying to solve in a Vim syntax file but I can't recall seeing anything to suggest that the capabilities of that system is significantly better than what's currently available in Vim. Generally, I find that they don't have a solution or at best have implemented a similar one to myself.
Even Microsoft, with all their resources, can't generate a C# grammar that I don't regularly find bugs in.
I have also experienced some truly horrific highlighting performance with C++ and TypeScript, some of which was bad enough to bring down VS Code, and recall finding plenty of highlighting performance related bug reports when I investigated.
Most of the other grammar options seem more expressive than TextMate and other systems like Tree-sitter offer something extra like the AST generation.
I'm aware this commentary is next to useless but i have 2c as well...
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I wonder if we could convert a TextMate grammer
into a Vim syntax file.
VimSyn is there, works and currently has ~300 languages. Just as an
interesting question is "can TextMate be translated to VimSyn?"; given
Vim's experience and what's out there, can a superior, even non
compatible, version of VimSyn be specified? and then current grammar
translated to it?
I continue to think that an interface is the way to go, rather than
picking this year's winner. I'd forgotten about LSP and thought it
didn't support syntax highlighting, but taking another look I saw a
bunch of stuff about work to add syntax highlighting API; I don't know
if that ever happened. This could be "the interface"?
I wonder if something like TextMate or TreeSitter could be front-ended
and/or made available through LSP.
Having the highlighting show up asynchronously, or become more
accurate after a delay, is considered bad, only to be used if there
is no other way. For the bulk of the files it should be
instantaneous. We have had quite a few users complain about
flickering
Does changing the color of a word cause flicker? If there's changes,
it means the first highlights were inaccurate. Would users choose to
have the display change to more accurate highlighting?; primarily
during startup of a new file. A scanner will never be as good as a
parser; I haven't used c++ for 20+ years, but I suspect some useful,
accurate highlight info (eg macros, templates, errors) would be
appreciated.
Of course, how performance is achieved is an implementation detail.
But having vim explicitly/API interoperate with capabilities like
"parse these areas of interest first" and "incremental parsing" and
"better results available for this area from the engine" might be
important. Does LSP handle these kinds of interaction?
Focusing on a fast internal syntax engine, optionally supplemented by
LSP when pinpoint accuracy is desirable. I guess managing two engines,
internal and LSP, at once is a big deal. So without VimSyn+LSP, if a
user chose a certain LSP, it might take a while for the initial syntax
highlights to show up. But with a good LSP front-ended implementation,
handling incremental parsing and windows of interest, after the
initial delay it should usually keep up.
I understand that LSP can go beyond syntax highlighting.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I wonder if we could convert a TextMate grammer into a Vim syntax file.
Most other new editors (new relative to Textmate the app) adopted the textmate format because of the "corpus of grammar files" available (https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations).
"corpus of grammar files" is not an insignificant advantage.
This leads me to agree that "converting textmate to vim syntax" this is worth investigating. And it doesn't have to be fully compatible.
Treesitter converts json and js files into C modules. Using their grammar config files is also worth looking into.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
PSA: LSP is a red herring here, it's irrelevant to the topic at hand.
LSP is about project-level "intelligence", meaning gathering and using cross-file information. While one of its newer features is "semantic highlighting" (which allows you to highlight, e.g., variables in one file differently if they are declared as const
in another), this is not its main purpose, and the LSP interface is a very poor fit for general syntax highlighting: Here in fact the OP's point applies: language servers are an external program, and the communication overhead means there'd be inacceptable latency.
(Think of it rather as providing an additional layer of more detailed highlights for some objects where additional information is available.)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I doubt it. AFAIK LSP gives you a fully parsed version of the code.
It depends on the Language Servers, but many Language Servers can accept half-formed source code and generate completion suggestions based on what is being typed.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Some testing with tree-sitter on native c
https://github.com/icedman/vim/tree/treesitter
https://github.com/icedman/vim/blob/treesitter/src/TREESITTER.md
using windowing mode (partial parse of 2000 only lines)
Notes:
The numbers doesn't account for rendering of highlights but only for tree parsing and updates.
Treesitter reads through the entire buffer even at single line updates. Still fast - but I'm not sure though how efficient this is as I used ml_get_buf.
In contrast - a textmate parser can be updated by feeding only the line edited and the parser state of the previous line. And then updates are done on the succeeding lines if necessary.
Tree-sitter can do an initial parse of 220K+ lines at 1.1seconds -- that is fast. In contrast, I did a test on textmate sometime ag - parsing throught 220K+ lines on a single go required 15secs.
Windowing mode is fast. Whether 2000 lines of partial parse is acceptable is another question. For full treesitter parse -
I think a problem that needs to be solved is querying very large tree and translating it to highlights, indents or whatever else can be used of it. I skimmed through nvim-treesitter, it looks like it utilizes a lot of caching.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
tree-sitter clone in javascript
https://lezer.codemirror.net/
possible in javascript = possible in vimscript?
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I skimmed through nvim-treesitter, it looks like it utilizes a lot of caching.
None of the highlighting logic is in nvim-treesitter. Upstream nvim only queries the updated ranges as reported by the parser after a incremental parse
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Late to the discussion here, but when people talk about the wealth of syntax highlighting files in TextMate format (due to VSCode's popularity), can people be more specific? Vim also has 300+ file formats supported and is a popular text editor in its own right, so I'm curious if there are specific examples of a decently popular format that has awesome syntax highlighting but non-existent in Vim? And is that only because of lack of interest in maintaining a Vim-specific syntax file, or genuine technical roadblock?
Looking through the thread, I'm still not sure what the supporting argument for pursuing TextMate's syntax format is other than "it's what the other guys are using". That's a poor argument, especially when the people using it (e.g. VSCode) also have a giant thread discussing adopting tree-sitter. The supposed gains seems a little hand-wavy and quite minor as both Vim's and TextMate's systems are fundamentally regex engines and the TextMate system is quite old by now. Adding a new system like this is a huge endeaver so picking something that could at best be a little better seems like a bad idea to me considering its cost (both in implementation and maintenance).
For tree-sitter, it does look like a more ideal solution, but I'm still a little concerned about the lack of true semantic highlighting (e.g. C++ macros as mentioned above), and whether that's just adopting a not-quite-there solution. I'm also not exactly sure how people are supposed to distribute third-party tree-sitter plugins. Binary releases seem like a big regression from human-readable .vim files (the alternative to binary release is distributing tree-sitter source files that people need to compile themselves which is also not great, unless Vim bundles a compiler…).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
I'm also not exactly sure how people are supposed to distribute third-party tree-sitter plugins.
I've been poking around treesitter. It could be possible to define a grammar from a vim file. Looking at the generated parser for C. It looks like it contains several parsing tables which could be imported from a file. It has lexer function which looks like it could be converted into a table as well.
From what I understand treesitter allows defining your own scanner, such as in the parser for CPP. A scanner coded by hand or not generated. It may also be possible though to allow callbacks from vimscript to implement the scanner.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
? And is that only because of lack of interest in maintaining a Vim-specific syntax file
Could be.
That is why it could be worth investigating if converting textmate grammars is possible.
It could also be worth investigating how far off are vim syntax from textmate highlights. Maybe a tweak or a minor feature add-on could improve it greatly.
I've been poking around textmate too. So far as to implementing it in C :) .. github.com/icedman/tiny-textmate ; it is s small enough in could live in a browser. github.com/icedman/wasm-tiny-textmate
Textmate It terms of speed - parsing a whole document, there is no way it can compete with treesitter (even if someone vastly improves my code). Probably no way also it could compete with vim syn to justify replacing it.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
That is why it could be worth investigating if converting textmate grammars is possible.
Right. I feel like that should be the first thing to try before we start incorporating the entire TextMate system into Vim. But step 0 should really be finding concrete examples first (as I mentioned). Maybe something like TypeScript, which is a Microsoft-developed technology (since Microsoft also makes VSCode)? It's useful to at least find out what the "worst case" situation is to begin with.
Back to tree-sitter. I'm not sure if it's necessarily the best technology for this (I still have concerns about the project's design which relies on pre-compiled binaries, which makes it hard to distribute plugins), but I do something like it (which provides context-free grammar support) is beneficial and the obvious next steps for providing more accurate syntax highlighting and/or code understanding (e.g. you can ask the editor to give you the scope of a function with this).
I guess my annoyance with the discussion here is still that there seems to be a lot of hearsays, comparison by analogy, deferral to authority ("the other editors are using this"), and just focusing on implementation specifics, instead of a more fundamental / principled discussion of what kind of properties we want from a syntax highlighting engine. TextMate and tree-sitter are quite different technologies, and so whether we want to adopt each should be a different discussion rather than just a "just pick a syntax engine" (since Vim already has one). It's easier if we could establish 1) what problems exactly we are trying to solve here, and 2) what are the properties we need.
To me, some of the existing problems with Vim's syntax highlighting are:
syn-sync
. This is tied to the performance issue since we can't always start parsing from beginning of file.Some properties I think we would like (note that not all of them are always solvable, and some properties could work against each other):
Anyway I digress, just my 2c.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
"editor uses this x x x" .. this should not be understated.
It's hard enough to make a new syntax engine. It's harder or at least exponential more work to create new grammars and this relies on individuals creating a grammar for their favorite language.
There's a of work already done in textmate. We would be wise to at least see what can be reused to help improve the existing engine.
Treesitter is the future (my humble opinion). Nvim-treesitter will most probably eventually improve.
It's also a good idea to join in the effort there
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Is this what one has to type to specify this?
https://github.com/tree-sitter/tree-sitter-typescript/blob/master/common/corpus/declarations.txt
No, this is a test file. Typescript source text and expected parsing result alternating (separated by ------
). The grammar specification is always called grammar.js
The typescript parser is confusing since it contains two parser definitions which share from a common JS file https://github.com/tree-sitter/tree-sitter-typescript/blob/master/common/define-grammar.js (called from tsx/grammar.js
and typescript/grammar.js
). The grammar definition has a dialect argument which makes distinctions between tsx and typescript possible. For all other languages, the grammar definition is the grammar.js in the root directory (e.g. https://github.com/tree-sitter/tree-sitter-cpp/blob/master/grammar.js).
It would be possible to compile into some kind of byte code,
but it must be a simple compiler, otherwise it gets too big. We already
have the regexp program "compiler" and the Vim9 :def function compiler,
something like that could work.
There were also complaints about the Node JS requirement for tree-sitter-cli. tree-sitter-cli is a kind of compiler: it has a front end which currently invokes Node JS https://github.com/tree-sitter/tree-sitter/blob/3563fe009aa3cf373ae01782979743e6aa258a0a/cli/src/generate/mod.rs#L171-L192. The output of the frontend is src/grammar.json
https://github.com/tree-sitter/tree-sitter/blob/3563fe009aa3cf373ae01782979743e6aa258a0a/cli/src/generate/dsl.js#L418 which contains a full description of the grammar. The backend generates C code which is dynamically loadable by the tree-sitter runtime https://github.com/tree-sitter/tree-sitter/tree/master/lib/src.
It is certainly possible to move the compilation to runtime or first load time. Everything the runtime expects is a TSLanguage https://github.com/tree-sitter/tree-sitter/blob/master/lib/include/tree_sitter/parser.h#L90-L127. See the last lines of src/parser.c
for the TSLanguage returned by a generated file. If Vim would write such a compiler, it could be used as alternative loading mechanism also in Neovim and Helix.
When Neovim or Helix want to load a language, e.g. the language foo
, they search for a symbol called tree_sitter_foo
. They will call tree_sitter_foo()
which returns the TSLanguage
the runtime expects. If Vim or upstream tree-sitter can provide a runtime compiler for a new grammar description language, they could load TSLanguage
also by calling grammar_runtime_compiler("path/to/grammar/definition.file")
with no changes to the tree-sitter runtime required. The initial revision of "path/to/grammar/definition.file" which can be a human editable format could be generated by a modified tree-sitter-cli with a changed compiler backend or popular languages are manually ported.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
I was more thinking of distributing compiled byte code. It would still
need to be in the form of a language description, rather than code that
is executed (to avoid trojan horses). If the byte code is more or less
readable, or turned into something readable with a Vim command, that
would be a big plus.
Tree-sitter uses WASM to run on the web. There were ideas for Neovim to run a WASM runtime as a plugin host.
The tree-sitter-cli is actually a nice starter for understanding how it
works: https://github.com/tree-sitter/tree-sitter-cli
tree-sitter-cli lives now here https://github.com/tree-sitter/tree-sitter/tree/master/cli. Tree-sitter-cli is a ahead-of-time compiler and not needed by a end user when the compilation result is distributed (WASM, binaries, C code, or something new tailored for the needs of Vim). I was arguing that tree-sitter-cli is just one compiler implementation (Node JS frontend, Rust backend) and an alternative one could be written with different grammar DSL. Editors typically ship only tree-sitter runtime (C library without dependencies). Any function that can return a TSLanguage struct should work with the C runtime.
I was more thinking of distributing compiled byte code. It would still
need to be in the form of a language description, rather than code that
is executed (to avoid trojan horses)
src/grammar.json
contains all the information needed. It can surely be represented in a more compact binary representation. Arbitrary code execution is only used for scanners. For C++, the scanner is only needed when some state needs to be stored which is only used for raw strings (https://www.geeksforgeeks.org/raw-string-literal-c/, for R"delimiter( raw_characters )delimiter"
it needs to store that it parsed delimiter
to determine the appropriate closing token)
A C++ parser is likely to be one of the most complex ones, not good as
an example.
Sorry, for that. C++ is also a bad example since most of the code lives in tree-sitter-c and rules are only extended in tree-sitter-cpp.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
A C++ parser is likely to be one of the most complex ones, not good as an example.
Try the JSON one: https://github.com/tree-sitter/tree-sitter-json/blob/master/grammar.js
But it’s worth noting that Tree-Sitter requires a custom lexer (implemented in C/C++) for many languages such as Python that cannot be parsed by a context-free grammar alone. So distributing grammars as byte-code might be challenging. Even if it were possible, that would make syntax highlighting even slower, and Tree-Sitter is already quite slow in my experience.
I actually did a performance comparison of several editors a few weeks ago, but unfortunately still haven’t gotten round to publishing the results.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Re treesitter, user end compilation is too much of hurdle that it eliminates itself as a possible replacement for syntax highlighting.
I would suggest opening a new issue for it to explore it further.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Vim syntax can be improved with some of the features from textmate:
.. I think #1 is not too much work and would by itself greatly enhance the existing engine
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
Is dynamic end matching different from :help :syn-ext-match
?
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
A next stop would be if we can make a converter from a TextMate grammer
to Vim syntax rules. So we can see if it could work.
I did try generating a vim syntax file from a textmate grammar file. So far, I could only make simple keyword matches work. The regex engine would complain: "too many parenthesis", and something like "you cannot use this pattern recursively"
Can you provide more details, so that we can have an idea of how
complicated this would be?
Will do.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.