Parsing crossed tags

Sylario Syl

unread,

Jul 2, 2012, 5:31:48 AM7/2/12

to pe...@googlegroups.com

Is it possible to keep a variable across the parsing? I have a presentation language (colors and formatting) where i can have this kind of configuration :

<color> texttex<italic>ttext</color>textt</italic>ext (not the actual syntax but you got the idea).

And i'd like to create the following output :

 texttexttexttextt</italic>ext

Is it possible to parse this kind of syntax with peg.js?

Shinya Hayakawa

unread,

Jul 2, 2012, 9:25:30 AM7/2/12

to pe...@googlegroups.com

Hi,
The following PEG might be what you want.
But I wouldn't recommend this.
"<color> texttex<italic>ttext</italic></color><italic>textt</italic>ext " is better.

{ var TAGS = []; }

start = tt:tagedtext* { return tt.join(''); }

tagedtext = "<" t:tag ">" { TAGS.push(t); return ''; } /
"</" t:tag ">" {i=TAGS.indexOf(t); if (0<=i) TAGS.splice(i,1); return '';} /
txt:text { return 0 < TAGS.length ? ""+txt+"" : txt; }

tag = "color" / "italic"

text = txt:[^<]+ { return txt.join(''); }

Steve Ross

unread,

Jul 2, 2012, 2:29:53 PM7/2/12

to pe...@googlegroups.com

I'm working on a similar problem and this is the first thread I've seen discussing it. Where I wound up going with my grammar was to match things and construct an AST by assigning objects. So, I'd have a tree like:

1 type: open_tag, value: color
2 type: text, value: 'textex'
2 type: open_tag, value: 'italic'
2 type: text, value: 'ttext'
1 type: close_tag, value: 'italic'
1 type: open_tag, value: 'italic'
2: type: text, value: 'ttext'
1 type: close_tag, value: 'italic'
0 type: text, value: 'ext " is better'

With that syntax tree, I'm able to create an emitter from the AST instead of from the parser. I don't know if my approach is a good one or a bonehead one, but it relies on writing each nonterminal rule like this:

document
(text / tagged_text)*

tagged_text
open_tag text close_tag

open_tag
"<" t:[a-zA-Z0-9]+ ( /)? ">" { return { type: 'open_tag', value: t }; }

close_tag
"</" t:[a-zA-Z0-9]+ ">"

Obviously, this doesn't handle matching begin and end tags, so the grammar I've spelled out only works if the tags are sensibly nested. Still, you could emit stuff from that tree, I guess.

I'm trying to present an alternative solution while at the same time asking whether mine makes any sense. Also, I've noticed that using pegjs, there is no such thing as a partially ok parse, so the syntax matching is strict. Either the input text matches the grammar or the tree is not populated and the exception is thrown. This is not how browsers work -- they try to do something sensible (or at least they try not to break horribly) on a syntax error.

Hope this helps and sorry for jumping in with my own junk here.

Steve

David Majda

unread,

Jul 29, 2012, 10:39:52 AM7/29/12

to syl...@gmail.com, pe...@googlegroups.com

2012/7/2 Sylario Syl <syl...@gmail.com>:

I wouldn't recommend using PEG.js to parse input like this. If I were
to write such a parser, I would make it work in two steps:

1. A lexical analyzer that would just produce a list of tokens
("start tag", "attribute", "end tag", "text", etc.).

2. A tree builder that would build an AST from the tokens. This
builder would contain some heuristic rules specifying how to deal with
mismatched end tags.

The first part -- the lexical analyzer -- could easily be done in
PEG.js, but it would probably be an overkill.

--
David Majda
Entropy fighter
http://majda.cz/

Reply all

Reply to author

Forward