The current syntax highlighting system is very slow, and there are noticeable lags when scrolling large C++ files which contain complex syntax elements.
Previously, most people suggest something like nvim-treesitter which will analyze source code in a background treesitter process and render keywords in the foreground with text-property.
But is it a good idea? I don't really think so,
at least 4 disadvantages for treesitter solutions:
changetick
increase to prevent such things, which is a little flaky.Background syntax highlighter is still immature, there are still many other strange issues in nvim-treesitter:
https://github.com/nvim-treesitter/nvim-treesitter/issues
If we introduce something like this, we shall take all these issues into account.
Syntax highlighting is the most important part of an editor, better not rely on any uncontrollable external programs.
We need some new things that can satisfy such goals below:
And TextMate's grammar engine is really a good candidate which is widely used in many IDE/editors, including vscode (see syntax-highlight-guide for details), sublime and many others.
VS Code uses TextMate grammars as the syntax tokenization engine. Invented for the TextMate editor, they have been adopted by many other editors and IDEs due to the large number of language bundles created and maintained by the Open Source community.
TextMate grammars rely on Oniguruma regular expressions and are typically written as a plist or JSON. You can find a good introduction to TextMate grammars here, and you can take a look at existing TextMate grammars to learn more about how they work.
The grammar can be defined in JSON, that means can be translated into viml or just plain JSON files.
We can specify which grammar engine to use for the given buffer:
And some new command can be used to change grammar engine:
:syntax grammar textmate
:syntax grammar default
:syntax load ~/.vim/syntax/cpp.json
for example, the snippet below can be included in the head of syntax files:
if has('textmate')
syntax grammar textmate
syntax load syntax/cpp.json
finish
endif
....
And lots of existing vscode/textmate syntax files can be reused with minimal modification.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub.
Triage notifications on the go with GitHub Mobile for iOS or Android.
Thank you for starting this discussion. I had a vague plan to look into integrating treesitter, it is good to know it also has disadvantages. Vscode is widely used, thus if it uses TextMate then there must be something good about it.
Comments welcome.
I'll just comment that I would take these comments about tree-sitter with a significant heap of salt.
The might be some misunderstanding here. Tree-sitter in neovim doesn't use an external process like coc.nvim. The parser runtime is a C library embedded into the editor itself (in total not more LOC:s than syntax.c + highlight.c in vim itself), and parses the buffer in memory and produces a syntax tree that in-process plugins can use (for highlighting but also for other purposing like text objects).
Right now the biggest problem with syntax highlighting is how inconsistent and unpredictable it is. An unified interface will be more than worth the effort.
TextMate will probably be better for keeping the syntax system more integrated & backwards compatible than using something like treesitter. Also the modular and overengineered plugin architecture of treesitter would be a huge departure from the way it is done right now, so we should be a little cautious about how much functionality to reimplement.
@bfredl How much longer does it take to load a larger file like src/evalfunc.c in Neovim when tree-sitter is enabled, compared to the default syntax highlighting? I'm assuming that default syntax highlighting is disabled for filetypes where tree-sitter is supported.
@bfrg src/evalfunc.c
from vim (10 000 lines) takes 80 ms more time with tree-sitter enabled for the initial parse (200ms compared to 120ms in my config)
treesitter is more than just syntax highlighting, it's also useful for text objects for example.
TextMate system is old, Sublime Text has been mentioned but it left it years ago to use its own syntax engine. Does it make sense to adopt a system that is already waning? And how big is its library if it must be included?
Also when saying that a system is more performant, some source/benchmark should be provided. Is it TextMate more performant than treesitter? Who says so?
@bfredl , thanks for figuring it out, and I made a new revision:
list of tree-sitter disadvantages:
nvim-treesitter
need to load an external shared library as the parser for each language, the shared library must be downloaded and compiled into .so
files (I know :TSInstall
can simplify these steps), building progress can break if gcc/clang is not installed, the plugin or neovim itself may break due to any common dynamic link library problems, eg: version incompatible when the plugin has updated but parser .so
files not, dependency conflict when loading the shared library.The biggest risk is parser quality, over 100+ open issues for parsers:
examples for inconsistency:
example for performance:
The parser quality problem is totally out of control, nearly impossible for us to fix all the parsers one by one.
@mg979 the core part of textmate syntax system is oniguruma, which is open source and well maintained by the community.
known editors / ides using textmate syntax system:
Monarch was initially built to support languages in VS Code. Then, they decided to switch for TextMate as well because of reasons listed here: microsoft/vscode#174 (comment) .
Some details:
VS Code's tokenization engine is powered by TextMate grammars. TextMate grammars are a structured collection of regular expressions and are written as a plist (XML) or JSON files. VS Code extensions can contribute grammars through the grammar contribution point.
The TextMate tokenization engine runs in the same process as the renderer and tokens are updated as the user types. Tokens are used for syntax highlighting, but also to classify the source code into areas of comments, strings, regex.
Starting with release 1.43, VS Code also allows extensions to provide tokenization through a Semantic Token Provider. Semantic providers are typically implemented by language servers that have a deeper understanding of the source file and can resolve symbols in the context of the project. For example, a constant variable name can be rendered using constant highlighting throughout the project, not just at the place of its declaration.
Highlighting based on semantic tokens is considered an addition to the TextMate-based syntax highlighting. Semantic highlighting goes on top of the syntax highlighting. And as language servers can take a while to load and analyze a project, semantic token highlighting may appear after a short delay.
The tokenizer of vscode/textmate is:
And here is the wrapper in javascript, it's neatly written and not hard to understand:
All we need to do is rewriting the javascript wrapper in C,
And thousands of textmate syntax files are ready to use.
No more than 4854 lines (including comments) in javascript/typescript
Tests excluded, it's 3779 lines of code (source: cloc(1)
).
Tests included, it's 5074 lines of code.
Why not Sublime grammar instead of TextMate grammar? It seems more powerful, and easier to read.
I think .sublime-syntax is more easy to write and readable.
Sublime text 3 has implemented a new grammar format that seems much better than the traditional textmate grammar.
Is it because there have been fewer .sublime-syntax
files written than .tmLanguage
ones? Is there a licensing issue with these files?
@lacygoill maybe textmate grammar is a little easier ? because there are reference implementations:
But sublime is closed source ? we need write it from scratch ??
But sublime is closed source ? we need write everything from scratch ??
Good point. I forgot that sublime was closed source.
Is TextMate much better (readibility, reliability, performance) than our current syntax highlighting mechanism?
Just for TypeScript alone, there have been 754 reported bugs, 41 remaining open currently.
Assuming we support TextMate, what would happen to our current issues related to syntax highlighting? Do we close them, and tell their authors to use the new syntax highlighting mechanism? If the users find issues in TextMate grammar files, do we accept their reports on this bug tracker? IOW, is it going to help reduce the number of remaining open issues here?
Because TypeScript is a new language that evolve quickly ?
Oniguruma + json like config is certainly faster enough than current vim's mechanism. People seldom encounter performance issues in syntax highlighting when using textmate/vscode/eclipse/jetbrains.
Sublime's grammar seems more readable and powerful than textmate, maybe oniguruma+config can achieve such thing.
I remember an issue where Vim was very slow when adding/removing text properties on CursorMoved
. It only occurred while the syntax highlighting was enabled. So, one might think that the latter was the culprit. It turns out that the syntax highlighting was fine; the issue was Vim redrawing the screen too much.
With regards to how people perceive the current syntax highlighting as being too slow, I wonder which part of the issue comes from the syntax highlighting itself, and which part from something else like (too much redraw).
People seldom encounter such issues in syntax highlighting when using textmate/sublime2/vscode/eclipse/jetbrains.
That's interesting. I hope it's really thanks to their own syntax highlighting mechanism, and not some other optimizations (like multithreading).
A couple of remarks:
after/syntax
, would it be possible to do that with TextMate as well?I think performance of vim syntax highlighting could be improved before trying alternatives, for example:
I want to add that we currently have no safe-guards for tree-sitter that are applied for regex-based highlighting like limiting the line number or doing background parsing like Atom would do.
Background syntax highlighter is still immature
I think background syntax highlighting (if you refer to asynchronous or separate threads highlighting) is neither implemented for tree-sitter nor for traditional vim highlighting. The possibility to make a fast thread-safe copy of the parsing state for tree-sitter or any other kind of multithreading is not used at the moment in Neovim.
Many of the issues you cited complained about features missing due to missing :h syntax
. It will always be difficult to transition from one syntax system to another especially when it is so widely supported like vim syntax/fold/indent files. Maybe it would be easier to maintain more compatibility with a system that works more similar.
About quality of the grammars, you surely have different trade-offs. VS-Code has significant more users than Atom and Nightly-Neovim. Tree-sitter parses the whole document which can help with complex syntax constructs and large-scale structure. However, it will easier get confused when it sees something that cannot be handled be the language grammar (preproc-constructs or non-standard language extensions) while regexes with a more local view are often still ok. The error recovering capabilities vary a lot on how the concrete grammar is written. Tree-sitter provides something in-between regex highlighting and LSP-like semantic highlighting, so it might not be necessary if the two latter are available for a language. Distributing binary is another challenge for tree-sitter. Arbitrary code execution through custom scanners enables highest flexibility but may also pose a security risk though if the parsers are not self-generated and the scanner code is not reviewed.
For those who haven't seen it, this is an excellent introduction to Tree-sitter, by the author: https://www.youtube.com/watch?v=Jes3bD6P0To&ab_channel=StrangeLoopConference
tl;dr: Tree-sitter is a (portable, dependency-free) C library which (conceptually) takes a grammar (expressed in JavaScript) and a source file, and returns a parse tree for the source file with respect to the grammar. The big selling point is that TS (claims that it) can handle syntax errors well (still return a reasonable parse tree) and that it is incremental (returns new parse trees efficiently/quickly given some code edits and previous trees).
Parsers for different languages are provided by the community and while I haven't seen this first-hand, I find it easy to believe that many of them are not great. But the project is much younger than TextMate, and GitHub uses it for its on-web syntax highlighting so there might be some corporate support there.
Personally, the thing I would be most excited about seeing is Vim exposing a representation of the syntax tree which can be used not just for syntax coloring but also for semantic editing (expand visual selection one AST node up, copy function body, etc.). IDK how well the Vim architecture supports this today. But in theory you could then plug in whatever parse-tree-generator you choose (Tree-sitter or TextMate).
If you are using an LSP language server, it's true that the LS can give you a parse tree (on which is even more accurate, esp. in the case of context-sensitive grammars like C++), but language server (which Vim also doesn't natively support yet) will always be slower (it will do more than a parser, for example it will resolve cross-file deps and so on) and therefore will have to be async and higher-latency. So I think there is room for both a fast incremental parse system (like Tree-sitter) and LSP support (for things like go-to-definition and find usage).
See also this discussion in the VSCode repo: microsoft/vscode#50140
As someone who has spent months writing and maintaining TextMate and tree-sitter grammars for real-world languages, let me tell you that the TextMate grammar system is totally broken, at least from a 2021 perspective. TextMate grammars are a nightmare to maintain and impossible to get right. Out of desperation, I even developed my own macro system (just like the authors of TypeScript's TextMate grammar), and it was still a nightmare.
tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).
Betting on TextMate grammars in 2021 would be an engineering crime.
I am not sure how much of your hyperbolic speech can be deemed accurate, but from what I can see one of the biggest problem with tree-sitter is the general low quality of parsers contributed by different people as pointed out by the OP. "Top-notch" is not the way I would describe it. Which certainly needs to be taken into account as it would require a vast amount of effort to deal with these issues Vim would inherit as a result of undertaking the HUGE project of integrating tree-sitter.
I can't speak for textmate grammar for lack of familiarity. Personally my biggest problem with tree-sitter (at least the way neovim does it) is it's dependency on the environment (gcc/clang), large binary size and the do-it-all mentality which suits neovim but definitely does not feel like the "vim way".
tree-sitter is in a completely different league. It's a top-notch incremental parser that can be used for accurate (!) syntax highlighting, code folding, code formatting, etc. tree-sitter grammars are dramatically easier to write and maintain, and it's actually possible to get them right. GitHub has been using tree-sitter for a while, and VSCode is also starting to use it (see https://github.com/microsoft/vscode-anycode).
If tree-sitter is top-notch, how come an ubiquitous and highly popular language like python is broken in it since quite a while?
When I tested neovim 0.5.1 with tree-sitter I ended up having to disable TS for python (which is the language I use the most) because the indenting and highlighting were unusable. Doesn't exactly inspire confidence.
I think this discussion is devolving more and more from the purely technical and into prejudices. It is very important here to distinguish
tree-sitter
(the engine, which I would agree with @fcurts is an excellent piece of software and fundamentally superior to other syntax engines);I think Vim should at this stage focus on 1. to make a reasoned decision (while it of course makes good sense -- and would make me very happy -- to take Neovim's approach and decisions for 2. into account; admitting that the two projects have different needs).
And I find it highly disingenuous to point fingers at 3. while ignoring that the quality of TexMate grammars (and, indeed, Vim's bundled syntax files) varies wildly as well. It's clear that (just like Neovim) you cannot simply switch engines and have to support both (on a per-language basis) for some time until the replacement catches up.
I was obviously talking about the engine, which is what matters in the long run. Regarding existing grammars, the difference is that tree-sitter grammars can be improved relatively easily because they can be reasoned about. On the other hand, improving real-world TextMate grammars is anywhere from difficult to impossible. (Often, fixing one problem causes an inexplicable problem somewhere else, which is only discovered later.)
I can't comment on integration aspects. I'm not even a Vim user. But as a language/tooling developer myself, I feel strongly that it's time to move past TextMate grammars, which is why I offered my insights. Good luck!
If tree-sitter is top-notch, how come an ubiquitous and highly popular language like python is broken in it since quite a while?
When I tested neovim 0.5.1 with tree-sitter I ended up having to disable TS for python (which is the language I use the most) because the indenting and highlighting were unusable. Doesn't exactly inspire confidence.
@jgb Indentation has nothing to do with tree-sitter itself. There is a very ad-hoc implementation of using the parsed tree as indentexpr. Python indentation is not working because this implementation just considers the current syntax node you are currently on which is nothing in case of the Python parser because the relevant syntax node ended in the previous line when you start a new one. One would have to add a rule that respects this case or tune the general logic at this point.
You always have to write some system that translates your parsed representation to indents. The quality of this translation says nothing about the quality of the representation itself.
As someone who recently spent some time writing a TreeSitter grammar, I have also become less enthusiastic of the project. I watched the author’s presentation a while ago and it sounded like the greatest invention since sliced bread, but in practice it doesn’t always work that well.
The biggest obstacle in my opinion is languages with preprocessors (e.g. C and C++). This isn’t something I had considered initially, but it is simply impossible to parse those languages with TreeSitter because you’re dealing with a language within a language. Now before someone mentions this: I know TreeSitter supports injections, e.g. JavaScript in HTML, but that’s not the same thing because, as I understand, each injection is essentially its own “program”. It’s fundamentally not possible to parse pre-processed languages with a context-free grammar. If you think about it, conditional compilation is as context-sensitive as it gets.
I’m talking about constructs like this:
#if FLAG if (foo) { #endif bar; #if FLAG } #endif
Or this:
#define BEGIN_FUNC void () { #define END_FUNC } BEGIN_FUNC bla; END_FUNC
Or this:
#define RENAME(x) renamed_ ## x void RENAME(my_func) { bla; }
How is TreeSitter supposed to generate an AST for such code if it doesn’t interpret the macros? It’s simply impossible. And often this will result in parse errors. Now, TreeSitter is in theory “fault tolerant”, so it should be able to recover from errors, but I’ve found that it often recovers in a weird, unpredictable way that causes syntax highlighting to be messed up. It gets even worse when we’re talking about using it for features like syntax-aware selections, indentations and folds: Just forget about it.
All TreeSitter grammars for preprocessed languages contain hacks to work around this issue, but they never work 100%. They just handle a few special cases, but blow up in the general case.
The next problem is that parsing is incredibly slow. I benchmarked parsing a 4 MB file and it took over a second. Depending on where you are coming from, that might not sound too bad, but 4 MB a second really isn’t impressive when you consider that modern RAM can handle tens of gigabytes per second. Quite frankly, I’m not sure this “incremental parsing” approach is all that useful when the implementation is so slow in practice. I guarantee I could write a hand-rolled parser that would just reparse the entire file on every edit and it would still be orders of magnitudes faster.
I’ve also found that syntactic highlighting doesn’t actually add that much value over a simple lexer, but it is significantly more complex. Semantic highlighting on the other hand is even more complex, but it also adds a lot of value. If I had to rate the cost-benefit relationship, I’d say: lexer > semantic > syntactic.
If I had to design a syntax highlighting system from scratch, I’d probably just go with a simple C API, something like this:
typedef enum {TOK_IDENT, TOK_STRING, TOK_OPERATOR, ...}; void highlight_tokens(const char *buf, size_t len, Token *tokens, const void *input_state, void *output_state, size_t state_size);
You just pass a chunk of data to the parser and then it returns a buffer with a character class for each character (or maybe an array of ranges, see also LSP for a similar approach). This is the most general form, giving you the greatest amount of flexibility. You could hand-roll a parser, or build one based on regexes or TreeSitter grammars or whatever. It doesn’t restrict you to a particular system.
I’d even consider getting rid of the state persistence stuff and just pass one large buffer containing the entire file and reparse the whole file every time. Because in the general case, you have to do it anyway. Consider putting a comment /*
at the beginning of a very large file. No matter what you do, sometimes, you’ll have to reparse everything, so I’m not sure it is even worth adding complexity to save time for only some edits. Better work on making the parser really fast. Computers are fast, it shouldn’t take that long to parse even a 100 MB file. And source files are usually much smaller than this.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Anyone who eagerly promotes tree-sitter here should answer my questions above first. Repeating its advantages a thousand times do not mean that these fatal problems will disappear.
Tree-sitter is not a new thing, no need to be so excited. Remember that Atom has adopted tree-sitter early in 2018, users in the atom communities are very calm about this "new" feature.
I don't need a better highlighter at the cost of perfomance and flexibility. Because I am suffering performance issues right now and all I want is a fast & static regex-based highlighting.
@lacygoill you claimed in this comment that the problem was caused by "drawing too much".
That's not true, I have done a bisect investigation in this problem here:
And found that there was a big performance regression after 8.0.643
and 8.0.647
. You can simply compare syntax highlighting speed difference in both vim 7.4 and the latest vim 8.3.xxxx and you will find that this is by no means a simple "drawing too much" problem.
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.
@lacygoill you claimed in this comment that the problem was caused by "drawing too much".
That's not true,
It is. The patch that fixed my issue only reduced how often Vim was redrawing the screen:
vim9script diff --git a/src/textprop.c b/src/textprop.c index b6cae70a8..e74c13849 100644 --- a/src/textprop.c +++ b/src/textprop.c @@ -809,6 +809,7 @@ f_prop_remove(typval_T *argvars, typval_T *rettv) int id = -1; int type_id = -1; int both; + int is_removed = FALSE; rettv->vval.v_number = 0; if (argvars[0].v_type != VAR_DICT || argvars[0].vval.v_dict == NULL) @@ -889,6 +890,7 @@ f_prop_remove(typval_T *argvars, typval_T *rettv) if (both ? textprop.tp_id == id && textprop.tp_type == type_id : textprop.tp_id == id || textprop.tp_type == type_id) { + is_removed = TRUE; if (!(buf->b_ml.ml_flags & ML_LINE_DIRTY)) { char_u *newptr = alloc(buf->b_ml.ml_line_len); @@ -920,7 +922,8 @@ f_prop_remove(typval_T *argvars, typval_T *rettv) } } } - redraw_buf_later(buf, NOT_VALID); + if (is_removed) + redraw_buf_later(buf, NOT_VALID); }
As anyone can see, the patch did one thing, and one thing only: it put a condition on redraw_buf_later()
; the latter can only be invoked if is_removed
is true:
if (is_removed)
It did nothing else. And yet, it was enough to fix the issue.
I have done a bisect investigation in this problem here:
Syntax highlighting is extremely slow when scrolling up in recent version (v8.0.1599) #2712
This has nothing to do with my comment. It's an entirely different issue. The only way your comment might be relevant would be if I had written:
whenever Vim is slow, it's because it redraws the screen too much
But I did not say that. And the comment you link did not say that either.
I wrote that in my issue, the cause was too much redraw.
I did not write that in all issues, the cause was too much redraw.
Two last notes before I unsubscribe from this thread.
Asking for questions or clarifications is OK, but saying that I lie is not. I don't want to read anything from you anymore, so I've blocked you.
I don't care whether Vim integrates tree-sitter, TextMate, or whatever software is trending right now. All I care is how reliable Vim is.
—
Reply to this email directly, view it on GitHub.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are subscribed to this thread.
@lacygoill, Sad to hear that. I have been following you on Github for years, reading your posts in the issues, and studying your early vim9 plugin projects. What I mean was nothing more than "your speculation may be wrong". Complaining that I complained you "lied" was a little overreacting.
You just blocked a faithful follower.
—
Reply to this email directly, view it on GitHub.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are subscribed to this thread.
I made the textmate parser portable. Removed the osx foundation codes. It may be worth a test as a vim plugin. Making one is beyond my skillset.
https://github.com/icedman/tm-parser
This library works well on my editor projects, including an ncurses based editor. Works well enough with my Flutter app als. Ashlar Code app for Android (munchyapps.com)
—
Reply to this email directly, view it on GitHub.
You are receiving this because you are subscribed to this thread.