Is it me of treetop create a node for each characters?

11 views
Skip to first unread message

Yves Dufournaud

unread,
Jun 20, 2025, 4:59:36 AMJun 20
to Treetop Development
Hello,

Maybe it's evident, maybe I am utterly wrong, but...

I am playing with treetop and this very simple grammar below.
And what strike me is that, I would have assume that their would be some form of reduction at some point, but from my grammar it looks like every char is being turned in a syntax node.

When looking on the NewLine node, I was assuming that It would be terminal and match "\n\n",,
It's clearly not, and has as children all the regexp like value:

    Newline offset=0, "\n\n":
      SyntaxNode+NEWLINE0 offset=0, "\n":
        SyntaxNode offset=0, ""
        SyntaxNode offset=0, "\n"
      SyntaxNode+NEWLINE0 offset=1, "\n":
        SyntaxNode offset=1, ""
        SyntaxNode offset=1, "\n"

Is it the the expectation?
Shouldn't it be a problem if source parsed is of significative size.
My initial assumption was that every regexp like expression would end up being turned in a terminal symbol and be a string (not a set of node).


My very simple grammar:
#python3_parser.treetop
grammar Python3

rule program
# Initialise the indent stack with a sentinel:
&{|s| @indents = [-1] }
( NEWLINE / stmt)*
end

rule stmt
indentation text:((!"\n" .)*) "\n" <LineNode>
{
def inspect(indent="")
indent + self.class.to_s + " " + text.text_value
end
}
end

rule indentation
' '*
end


rule NEWLINE
([\t ]* "\n")+ <Newline>
{
def inspect(indent="")
indent + self.class.to_s + " #{elements.size}"
end
} end

end

Parser code

# In file parser.rb
require 'treetop'

class LineNode < Treetop::Runtime::SyntaxNode
def value
text_value.strip
end
end

class Newline < Treetop::Runtime::SyntaxNode
end


class Parser
# Load the Treetop grammar from the 'python3_parser.treetop' file -> produce <grammar name>Parser
# and then create a new instance of that parser as a class variable so we don't have to re-create
# it every time we need to parse a string
Treetop.load('python3_parser.treetop')
@@parser = Python3Parser.new

input = <<~TEXT


block
line1

line2
nested
line3
TEXT

tree = @@parser.parse(input)

if tree
puts "Parsed successfully!"
p tree
else
puts "Parsing failed at: #{@@parser.failure_line} #{@@parser.index}: #{@@parser.failure_reason}"
end
end


The Tree I am getting as a result:
Parsed successfully!
SyntaxNode+Program0 offset=0, "...\n    nested\n  line3\n":
  SyntaxNode offset=0, ""
  SyntaxNode offset=0, "...\n    nested\n  line3\n":
    Newline offset=0, "\n\n":
      SyntaxNode+NEWLINE0 offset=0, "\n":
        SyntaxNode offset=0, ""
        SyntaxNode offset=0, "\n"
      SyntaxNode+NEWLINE0 offset=1, "\n":
        SyntaxNode offset=1, ""
        SyntaxNode offset=1, "\n"
    LineNode block
    LineNode line1
    Newline offset=16, "\n":
      SyntaxNode+NEWLINE0 offset=16, "\n":
        SyntaxNode offset=16, ""
        SyntaxNode offset=16, "\n"
    LineNode line2
    LineNode nested
    LineNode line3


Clifford Heath

unread,
Jun 20, 2025, 5:01:30 AMJun 20
to treetop-dev@googlegroups.com Development
Your grammar asks for each char as a separate atom, so that is what you got.

Use *[^\n] instead.

Clifford Heath 

--
You received this message because you are subscribed to the Google Groups "Treetop Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to treetop-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/treetop-dev/e10e3cc1-11ce-415c-9d9a-dbd40eaa1d94n%40googlegroups.com.

Clifford Heath

unread,
Jun 20, 2025, 5:03:43 AMJun 20
to treetop-dev@googlegroups.com Development
Ah sorry, I have been working with my prefix regular expressions too long.

Use [^\n]* instead.

Clifford Heath 



On Fri, 20 June 2025, 18:59 Yves Dufournaud, <yves.du...@gmail.com> wrote:

Yves Dufournaud

unread,
Jun 20, 2025, 6:42:54 AMJun 20
to Treetop Development
Hi Cliffoard,

Thanks for your feedback.
I was about to ask where are atoms defined, but re-reading the doc I found:
https://github.com/cjheath/treetop?tab=readme-ov-file#terminal-symbols

_Terminals are called atomic expressions_

In fact was I was looking for (probably I read too fast at some point) is this part:
_Treetop now also supports regular expressions as terminals. _

Because I intend to have more complex pattern than [.]*.
But I failed to find it, because I didn't see example using it. Yet in same paragraph I found that
_regular expression can be used at atoms,_ and I can transform may grammar to it by using the syntax ""r, so:
([\t ]* "\n")

becomes
"([\t ]*\n)+"r

Which is what I was trying to achieve in my example.
In fact I thought I was already writing regexp, because using only terminals (I was assuming some kind of optimisation there), 
and I was wrong. 

Thanks again,

Regards,

Yves
Reply all
Reply to author
Forward
0 new messages