Starting syntax highlighter project

139 views
Skip to first unread message

Sage Gerard

unread,
Feb 18, 2020, 1:27:13 PM2/18/20
to Racket Users
Hi folks,

I'm starting a syntax highlighter project here: https://github.com/zyrolasting/syntax-highlighting

There seems to be a precedent for using an existing highlighter from another ecosystem. I understand the pragmatism behind that, but a syntax highlighter seems to me a missing battery in Racket.

Without funding I am unable to compete with the implementations with 200+ languages supported. So I merely intend to provide two renderers (Terminal and HTML [XML syntax]), some palettes, and a sensible extension to parsack that classifies characters using Pygments' token classes. This should provide a strong, familiar core on which to add features according to community interest.

That being said, I invite feedback and collaboration to add support for languages once the core mechanisms are established.

~slg


Martin DeMello

unread,
Feb 19, 2020, 4:11:52 PM2/19/20
to Sage Gerard, Racket Users
Nice, I'll be following this with interest! What are the pros and cons of developing a new syntax highlighting format from scratch, versus e.g. parsing and reusing the kate style files? For the latter route this haskell package is a good source of inspiration: https://hackage.haskell.org/package/skylighting

martin

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/GC9QAolNyNFBE-KrDXfvA0q0P5_7jBWs7DEger3xGw5sYRmKg0rowpzyLIlyb1hrDOg_2xxTLn74TR5pCDIPhOzqIN5baGHoO7TU4GuaLDI%3D%40sagegerard.com.

Sage Gerard

unread,
Feb 19, 2020, 5:19:54 PM2/19/20
to Martin DeMello, Racket Users
Thank you for the reference, Martin. After looking at skylighting I ended up reading some XML specs in [1] after visiting Kate's.

I can some potential shortcuts with the XML specs, but I'm seeing a lot of data-entry that won't really know how to highlight things like "->" in Racket or "X Y(Z);" in C++. I'm hoping parsack can handle these nuances. Still, the implementation burden is high and writing parsers means not having a syntax-highlighting data format.

I'm thinking about writing the package such that a pre-installer builds an isolated Docker instance and runs a container for an existing highlighter--The goal being to provide what feels like a pure Racket package with no extra install steps. From there I can add parsack implementations over time. No clue how that will go.


~slg


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Sorawee Porncharoenwase

unread,
Feb 19, 2020, 11:38:09 PM2/19/20
to Sage Gerard, Martin DeMello, Racket Users

In my opinion, it would be very cool if the project is based on DrRacket’s color-lexer ecosystem. So instead of writing an ad-hoc lexer for, say, Python, we instead help improving the existing #lang python‘s color-lexer.

Pro:

  • The result will look just like what DrRacket displays.
  • Integrate well with #lang. Potentially useful for Scribble, since right now Scribble really only displays Racket code well.
  • Push the community to improve/write more color-lexer

Con:

  • Token styles are pretty limited (only 8 categories), but then again, perhaps this will push DrRacket to support more than 8 categories.

It looks like we can programmatically extract information from color-lexer via module-lexer. See this file for its usage.

And since we use DrRacket’s ecosystem, we can go further than just syntax highlighting. E.g., showing a binding arrow interactively when hovering mouse over an identifier. The code example at this link shows how we can extract such information.


Sage Gerard

unread,
Feb 19, 2020, 11:55:16 PM2/19/20
to sorawe...@gmail.com, martin...@gmail.com, racket...@googlegroups.com
I'm very much in favor of interoperability and am happy to work in that direction. Does this imply that we need a #lang for each highlighting target? What happens if you want to highlight code mixtures? Some snippets of interest to me can include Javascript, Markdown, CSS, HTML and Racket all within 20 lines.



-------- Original Message --------

Sorawee Porncharoenwase

unread,
Feb 20, 2020, 12:15:44 AM2/20/20
to Sage Gerard, Martin DeMello, Racket list
On Wed, Feb 19, 2020 at 11:55 PM Sage Gerard <sa...@sagegerard.com> wrote:
I'm very much in favor of interoperability and am happy to work in that direction. Does this imply that we need a #lang for each highlighting target?

With my approach, yes, but note that technically, the #lang doesn't need to be functional. The whole module could just expand into a raising of a runtime exception "not implemented". I'm not sure that in practice this is a good idea though. 
 
What happens if you want to highlight code mixtures? Some snippets of interest to me can include Javascript, Markdown, CSS, HTML and Racket all within 20 lines.

Then the color lexer would need to be context-sensitive and knows when to switch its lexing mode. Note that this is not a problem due to this approach. Any other approaches would have the same problem. 

It would be cool if there's a way to annotate code with `#reader` at the meta level, which would make this problem much easier...

Philip McGrath

unread,
Feb 20, 2020, 12:51:18 AM2/20/20
to Sorawee Porncharoenwase, Sage Gerard, Martin DeMello, Racket list
You don't need a `#lang` to use `color:text<%>`: I've used it to do basic syntax highlighting for XML. In fact, you don't even need a GUI for the relevant part of the protocol, which is what `#lang`s implement. The requirements are described in the documentation for the `get-token` argument to the `start-colorer` method. There are specific requirements on how the function must behave to support efficient interactive re-tokenization: these are overkill for an ahead-of-time syntax highlighter, but if your lexers can meet those requirements, they should be usable for a wide range of tasks.

I have a theory that you could use delimited continuations to help with some of the bookkeeping, with the continuation becoming (part of?) the "mode" value passed between calls to the `get-token` function.

-Philip





--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages