A tree-sitter grammar for Shen (plus a reader question about regex.shen)

147 views
Skip to first unread message

Luiz de Milon

unread,
May 30, 2026, 7:36:24 PM (8 days ago) May 30
to Shen

Hello,

I used Claude to help me write a tree-sitter grammar for Shen, and I'd like to share it with the community: https://github.com/luizdemilon/tree-sitter-shen

To be clear about provenance: the grammar, queries, tests, and docs were written by Claude under my direction — I scoped it, made the design decisions, and validated the result against the official sources, but I didn't hand-write the parser. I'm sharing it because I believe it will be useful to more people.

What it is: tree-sitter gives editors fast, incremental, structural parsing, so this provides syntax highlighting and structural navigation for Shen in Neovim, Emacs (29+ treesit), and Zed. It traces to the Official Shen Manual §12 construct by construct (a GRAMMAR.md maps each BNF production to a grammar rule), and I validated it by parsing the whole of shen-sources.

That validation is where I have a question for people who know the reader. It parses every file cleanly except for one line of valid Shen, in lib/stlib/Strings/regex.shen

(master, 93ed67e):

228: [| |RS] -> (re-or RS)

229: [bar! |RS] -> (re-or RS)

Line 228 uses a bare | as a literal list element (the regex "or"); line 229 uses the escaped bar! 

Tracing sources/reader.shen ("<bar> <s-exprs> := [bar! | <s-exprs>]", then cons-form), both [| |RS] and [bar! |RS] seem to read to the same thing — (cons bar! RS). If that's right, the two clauses are identical patterns and line 228 is redundant.

Two questions:

1. Am I reading that correctly — do [| |RS] and [bar! |RS] produce the same pattern, or is there a reader subtlety that makes them distinct?

2. This is the only place in all of shen-sources where a literal bar is written as  | rather than bar!. Since the bare form is ambiguous for any tool that reads | as the cons separator, would it be reasonable to standardize on bar! here? It'd be a one-line change with (as far as Claude can tell) no behavioral effect.

This isn't exactly a bug — regex.shen loads and works in Shen; it's a question about the reader and about tidying the spelling for tooling. (Validation also turned up one genuinely truncated file, tests/lisp.shen; I reported it as https://github.com/Shen-Language/shen-sources/issues/113 and it's already been fixed upstream — thanks, tiz0c!)

Feedback very welcome — on the grammar, the design trade-offs, or anything I've gotten wrong about Shen. And if it's useful to the community, I'd be glad to see it live under the Shen-Language Github org.

Thanks,

Luiz

Luiz de Milon

unread,
May 31, 2026, 1:04:10 AM (8 days ago) May 31
to Shen
In case it interests anyone, I also just published a Zed extension to highlight Shen: https://github.com/luizdemilon/zed-shen

I'll get this published into the Zed registry shortly.
Message has been deleted

Mark

unread,
May 31, 2026, 1:10:20 AM (8 days ago) May 31
to Shen
Cool and a good idea. I wrote my own tree-sitter-shen grammar I can post here. It extends the vanilla Shen grammar with new features behind a few compilation flags. I'm now rewriting Scryer Shen in Common Lisp and I've introduced a functor construction in Shen to mirror ISO Prolog's functors.

Perhaps we can compare notes. I had to expand several rules significantly in my specification to get the more ambiguous parts of the Shen grammar to parse. I also attached the generated parser to a Common Lisp REPL using the cl-tree-sitter library and wrote a pretty printer for the parsed Shen grammar using the Common Lisp Pretty Printing System, for round trip debugging.

Google Groups won't let me post the grammar.js file directory so here are its contents, sorry for the wall of text:

/**
 * @file tree-sitter parser for the Shen programming language.
 * @author Mark Thom <markjor...@gmail.com>
 * @license MIT
 */

/// <reference types="tree-sitter-cli/dsl" />
// @ts-check

// Toggle this to enable/disable functor support
const functor_ext = true;

const functor = $ =>
      seq(
          '(',
          field("functor", $.functor_symbol),
          repeat(field("argument", $.item)),
          ')'
      );

const functor_pattern = $ =>
      seq(
          '(',
          field("functor", $.functor_symbol),
          repeat(field("argument", $.pattern)),
          ')'
      );

// Base item rule (no functor syntax)
const base_item = $ => choice(
    $.base_pattern,
    seq('[', field("list", repeat1($.item)), ']'),
    prec(1, seq('[', field("head", repeat1($.item)), '|', field("tail", $.item), ']')),
    $.abstraction,
    $.application,
);

// Conditionally extended item rule
const extended_item = $ => choice(
    functor($),
    base_item($)
);

const keyword = word => token(prec(1, word));

module.exports = grammar({
    name: "shen",

    extras: $ => [/\s/],

    rules: {
        source_file: $ => repeat($.definition),

        datatype_kw: $ => keyword('datatype'),
        defmacro_kw: $ => keyword('defmacro'),
        defprolog_kw: $ => keyword('defprolog'),
        define_kw: $ => keyword('define'),
        colon: $ => keyword(':'),
        semicolon: $ => keyword(';'),

        if_kw: $ => keyword('if'),
        let_kw: $ => keyword('let'),
        let_bang_kw: $ => keyword('let!'),
        lambda_kw: $ => keyword(choice('/.', 'lambda')),

        type_open_kw: $ => token(prec(2, '{')),
        type_close_kw: $ => token(prec(2, '}')),

        arrow: $ => token(prec(2, choice('->', '<-'))),
        left_double_arrow: $ => token(prec(2, '-->')),
        right_double_arrow: $ => token(prec(2, '<--')),

        where_keyword: $ => token(prec(2, 'where')),

        alpha: _ => /[a-zA-Z\.=\-*/+_?$!@~><&%\'#`;:{}]/,
        digit: _ => /[0-9]/,
        lowercase_alpha: _ => /[a-z=\-*/+_?$!@~><&%\'#`;:{}]/,

        signs: _ => token(repeat1(choice('+','-'))),
        integer: _ => token(/[0-9]+/),
        float: _ => token(choice(
            seq(/[0-9]+/, '.', /[0-9]+/),
            seq('.', /[0-9]+/)
        )),

        number: _ => token(prec(2,
                                /[-+]?(?:\d*\.\d+|\d+)(?:[eE][-+]?\d+)?/
                               )),

        underline: $ => token(prec(2, repeat1('_'))),
        double_underline: $ => token(prec(2, repeat1('='))),

        functor_symbol: $ => token(
            prec(2, /@[a-z=\-*/+?$!@~><&%\'#`:;{}][a-zA-Z0-9\.=\-*/+_?$!@~><&%\'#`:{}]*/),
        ),

        symbol_literal: $ => choice(
            token(prec(1, /[a-z=\-*/+?$!@~><&%\'#`:;][a-zA-Z0-9\.=\-*/+_?$!@~><&%\'#`:]*/)),
            keyword('{'),
            keyword('}'),
        ),

        variable_literal: $ => choice(
            token(prec(1, /[A-Z][a-zA-Z0-9\.=\-*/+_?$!@~><&%\'#`:]*/)),
        ),

        lowercase_literal: $ => choice(
            token(prec(1, /[a-z][a-zA-Z0-9\.=\-*/+_?$!@~><&%\'#`:]*/)),
        ),

        placeholder: $ => token(prec(2, '_')),

        pattern: $ => choice(
            $.placeholder,
            $.base_pattern,
            seq('[', repeat1(field("head", $.pattern)), optional(seq('|', field("tail", $.pattern))), ']'),
            seq('(', 'cons', field("car", $.pattern), field("cdr", $.pattern), ')'),
            functor_ext ? functor_pattern($) :
                seq('(', choice('@p', '@s', '@v'),
                    field("first", $.pattern), repeat1(field("rest", $.pattern)), ')'),
        ),

        boolean_literal: $ => token(prec(2, choice('true', 'false'))),
        string_literal: $ => token(prec(1, /"([^"\\]|\\["\\/bfnrt])*"/)),

        abstraction: $ => seq(
            '(',
            $.lambda_kw,
            field("parameters", repeat1($.variable_literal)),
            field("body", $.item),
            ')',
        ),

        application: $ => seq(
            '(',
            field("items", repeat1($.item)),
            ')',
        ),

        rule: $ => seq(
            repeat(field("patterns", $.pattern)),
            $.arrow,
            field("result", $.item),
            optional(seq($.where_keyword, field("where", $.item)))
        ),

        item: $ => functor_ext ? extended_item($) : base_item($),

        base_pattern: $ => choice(
            field("boolean", $.boolean_literal),
            field("symbol", $.symbol_literal),
            field("variable", $.variable_literal),
            field("string", $.string_literal),
            field("number", $.number),
            field("empty", seq('(',')')),
            field("nil", seq('[',']')),
        ),

        definition: $ => choice(
            $.datatype_definition,
            $.prolog_definition,
            $.shen_def,
            $.application,
        ),

        shen_def: $ => seq(
            '(',
            field("keyword", $.define_kw),
            field("name", $.lowercase_literal),
            optional(field("type", seq(
                $.type_open_kw,
                field("type_expr", $.type),
                $.type_close_kw
            ))),
            repeat1(field("rule", $.rule)),
            ')',
        ),

        datatype_definition: $ => seq(
            '(',
            field("keyword", $.datatype_kw),
            field("name", $.lowercase_literal),
            repeat1(field("rules", $.datatype_rule)),
            ')',
        ),

        side_condition: $ => choice(
            seq($.if_kw, field("condition", $.item)),
            seq($.let_kw, field("binding", $.prolog_pattern), field("value", $.item)),
            seq($.let_bang_kw, field("binding", $.prolog_pattern), field("value", $.item)),
        ),

        scheme: $ => prec.left(1, seq(
            field("context", $.formula),
            optional(
                seq(
                    field("context", repeat(seq(keyword(','), $.formula))),
                    keyword('>>'),
                    field("conclusion", $.formula),
                ),
            ),
        )),

        simple_scheme: $ => prec.left(2, seq(
            field("formula", $.formula),
            $.semicolon,
        )),

        formula: $ => choice(
            prec(1, seq(field("term", $.item), $.colon, field("type", $.item))),
            $.item
        ),

        type: $ => choice(
            prec(1, seq($.left_double_arrow, $.type)),
            $.inner_type,
        ),

        inner_type: $ => choice(
            $.base_pattern,
            $.application,
            seq('[', field("head", $.pattern), '|', field("tail", $.pattern), ']'),
            seq('[', repeat1(field("element", $.pattern)), ']'),
            prec.right(2, seq($.type, $.left_double_arrow, $.type)), // A --> B
        ),

        datatype_rule: $ => seq(
            field("conditions", repeat($.side_condition)),
            field("pre_premises", repeat($.simple_scheme)),
            choice(
                seq(
                    $.double_underline,
                    field("conclusion", $.formula),
                    $.semicolon
                ),
                seq(
                    $.underline,
                    field("conclusion", $.scheme),
                    $.semicolon,
                ),
                seq(
                    field("premises", repeat(seq($.scheme, $.semicolon))),
                    $.underline,
                    field("conclusion", $.scheme),
                    $.semicolon,
                ),
            )
        ),

/* // this is a more natural datatype_rule grammar but it's too
   // ambiguous for tree-sitter.
        datatype_rule: $ => choice(
            seq(
                field("conditions", repeat($.side_condition)),
                field("premises", repeat(seq($.scheme, $.semicolon))),
                $.underline,
                field("conclusion", $.scheme),
                $.semicolon
            ),
            seq(
                field("conditions", repeat($.side_condition)),
                field("premises", repeat1($.simple_scheme)),
                $.double_underline,
                field("conclusion", $.formula),
                $.semicolon
            )
        ),
*/

        prolog_definition: $ => seq(
            '(',
            $.defprolog_kw,
            field("name", $.lowercase_literal),
            field("clauses", repeat1($.clause)),
            ')'
        ),

        prolog_pattern: $ => choice(
            $.placeholder,
            $.base_pattern,
            seq('[', field("head", repeat1($.prolog_pattern)), '|', field("tail", $.prolog_pattern), ']'),
            field("list", seq('[', repeat1($.prolog_pattern), ']')),
            seq('(', 'cons', field("car", $.prolog_pattern), field("cdr", $.prolog_pattern), ')'),
            ...(functor_ext ? [functor_pattern($)] : []),
        ),

        clause: $ => prec.left(1, seq(
            field("head", repeat($.prolog_pattern)), $.right_double_arrow, optional(field("tail", $.tail)),
            $.semicolon,
        )),

        tail: $ => choice(
            seq(field("cut", keyword('!')), optional(field("rest", $.tail))),
            seq(field("goal", $.application), optional(field("rest", $.tail))),
        ),
    }
});

Luiz de Milon

unread,
May 31, 2026, 6:47:52 PM (7 days ago) May 31
to Shen
Hey Mark,

I'm happy you reached out! This was the first effort I made in using Claude to prototype a series of tools to bring the Shen development experience towards what I'm used to. I saw that the syntax highlighting/etc story wasn't homogeneous everywhere and had Claude do a survey of what tools currently exist, etc. So I decided to try and get it to make a tree-sitter grammar so I could read Shen code more comfortably, with syntax highlighting, and whatever else.

There'll be more to come on that front: since yesterday, based on this tree-sitter code, I also prototyped a shenfmt tool to reformat Shen code. Now, as we know, Shen doesn't quite have a global standard everywhere like Go, so I made it be a survey tool and also a configurable formatter. I set the presets to match the most common styles I found in this survey across the sources I parsed (the Shen kernel and shen-sources). It's available at https://github.com/luizdemilon/shenfmt.

Regarding your grammar.js file, considering I don't yet have the technical prowess to review it myself, I directed a comparison via Claude, and here's the report (also written by Claude, all the other text in this email was hand-written by me :D )

> - Ran your grammar.js through the same corpus harness (pinned shen-sources, tree-sitter
>   0.26.8): your grammar parses 55/138 files cleanly, this repo's grammar 138/138.
> - The cases we'd expected to break — separators used as data, e.g. [<-- | B], [a --> b],
>   [{ }], := inside a list — actually parse fine; tree-sitter's lexer is state-aware, so
>   those operator tokens don't fire in data position. (We assumed otherwise; running it
>   corrected the assumption.)

> - The divergence comes down to two non-fundamental things: no comment rule (every
>   kernel/stlib file opens with \\ ...), and definitions recognised only at the top level —
>   but every kernel file is one (package ...), so nested defines/datatypes collapse to
>   plain applications, and their _ patterns and ___ sequent lines become invalid (e.g.
>   lists.shen errors at the _ in `_ [] -> []`; maths.shen at a datatype's ___). Recursing
>   the package body through the top level clears it.

> - Not a knock on yours — it targets your extended dialect and you validate by round-trip,
>   not by chewing the vanilla corpus. They look complementary: yours has the structured
>   sequents/clauses/functors; this one has package recursion, comments, and a corpus
>   regression harness.

I also ran a new /deep-research to figure out how much work it would take to get Shen's error messages to Rust-level, the results were:

> - Short version: very doable, and less than I'd feared — the Scryer-hosted checker
>   already builds most of what's needed. Rust-grade diagnostics split across the three
>   compile-time surfaces (reader/parser, the sequent type checker, the Prolog/datatype
>   rules), and the checker already constructs a full proof tree that today just gets
>   dropped on failure.
>
> - The one genuinely hard part is recovering where/why a check fails, since Prolog discards
>   the proof when it backtracks. But the attributed-variable hooks (verify_attributes) fire
>   at the failing unification, before backtracking unwinds — exactly the place to record it.
>   A "keep-deepest failed goal" recorder on top of that needs no CPS rewrite of the prover.
>
> - It's prototyped end-to-end against the actual scryer-shen (Racket+Scryer): on
>   (apply + [1 2 3]) the patched checker recovers the real culprit — type_check([3], (h-list []))
>   i.e. "the third argument is extra; + takes two" — instead of a bare "type error".
>
> - Since it all lives on the Prolog side, it should carry over to your CL rewrite as-is; a
>   tree-sitter front end then supplies the source spans to render it Rust-style (snippet +
>   caret + message, and eventually stable error codes / --explain).
>
> - The rest is staged, well-scoped steps rather than a rewrite: surface the proof tree, pass
>   a structured diagnostic across the boundary, the recorder above, then spans + a renderer,
>   then JSON for editor/LSP.

Here's the full report: https://github.com/luizdemilon/tree-sitter-shen/blob/compare/thom-grammar/THOM-GRAMMAR-COMPARISON.md

Anyway, this is exactly where you'd know far more than the model
or I do — if you're up for it I'll start a dedicated thread, and the prototype + full writeup
are yours to look at whenever.

Luiz

dr.mt...@gmail.com

unread,
Jun 1, 2026, 1:55:33 AM (7 days ago) Jun 1
to Shen

1. Am I reading that correctly — do [| |RS] and [bar! |RS] produce the same pattern, or is there a reader subtlety that makes them distinct?

2. This is the only place in all of shen-sources where a literal bar is written as  | rather than bar!. Since the bare form is ambiguous for any tool that reads | as the cons separator, would it be reasonable to standardize on bar! here? It'd be a one-line change with (as far as Claude can tell) no behavioral effect.

They do.   The bar! was introduced to make it easier for the Shen compiler
to handle | as a standard symbol.  | (borrowed from Prolog syntax) was  not 
intended really to be used for anything else but consing.   [X | Y] is simply syntactic
sugar for (cons X Y).  I think bar! should be made internal to the Shen package but
more comprehensively I'd now write the compiler to eliminate | w.o. bar!.   In
the revised syntax scheme | would not be treated as a regular symbol, avoiding
the unfortunate situation where (intern "|") <> |.

Mark 

Mark

unread,
Jun 4, 2026, 4:42:46 PM (3 days ago) Jun 4
to Shen
Hi Luiz,

My grammar is directly based on the Shen syntax EBNF here: https://shenlanguage.org/OSM/Syntax.html

It's true that it doesn't have rules for the defpackage or shen-yacc forms, as Claude noted. I always supposed these forms would be bootstrapped by the macro system, but if you intend to use the tree-sitter grammar for editor highlighting, yes, it should have them. My grammar is more longer and elaborate, which allows for greater ease in compiling the normalized AST down to Common Lisp and fine-tuning shen-mode in Emacs.

In fact, the AST nodes are objects of CLOS classes, which I believe can be used as a basis of syntax classes analogous to Racket's system of syntax classes in syntax-parse, but much simplified. From there one could create a hygienic macro system for Shen using shen-yacc for parsing, where new syntax classes are again defined as CLOS classes and destructured using the trivia pattern matching library as I'm already doing. This would be a much safer and more powerful (and even typed!) macro system for Shen. But it breaks backward compatibility of Shen's reader macro system, which simply transforms trees of cons cells to trees of cons cells. These macros would deal in the much richer, well-typed structure of the ASTs produced by tree-sitter + Scryer Shen's normalizer.

Mark

unread,
Jun 4, 2026, 4:47:39 PM (3 days ago) Jun 4
to Shen
As to the topic of human readable type errors for Shen, a large and ambitious chunk of the scryer-shen project is to produce a visual debugger rendering proofs (whether successful or failed) as trees in the syntax of Gentzen's sequent calculus. That is, the type checker would spit out these trees in a text format, according to a grammar that's renderable as a proof tree (using ImageMagick? some LaTeX package? CLOG? I don't quite know yet) and that allows the programmer to focus on particular portions of the tree, and to examine those portions in greater or lesser degrees of detail. I suppose similarly, a Prolog program could be written to parse that representation and produce a narrative in natural language explaining why a proof failed at a particular point, for instance. It would still require the programmer to have some idea of how proofs are successfully conducted in that system, but the visual debugger would also serve as a didactic tool in this sense, by depicting proof search as a stepwise process involving unifications over constraints.

nha...@gmail.com

unread,
Jun 6, 2026, 2:40:46 PM (yesterday) Jun 6
to Shen
Are any of the local AI models strong enough to view a Shen spy trace and tell you what line of code in the function is causing trouble? ChatGPT was able to do it reasonably well about a year ago.
Reply all
Reply to author
Forward
0 new messages