Modify XML Lexer grammar to separate opening/closing tags from their contents

27 views

antlrlexer

Skip to first unread message

Alex Spurling

unread,

Jun 19, 2018, 6:30:44 AM6/19/18

to antlr-discussion

I am currently using the XML lexer grammar defined here:

https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4

With the given input, I get the following lexer events generated:

XML input:

<item>

<![CDATA[

My CDATA Block

]]>

</item>

Lexer output:

[@-1,0:0='<',<7>,1:0]

[@-1,1:4='item',<16>,1:1]

[@-1,5:5='>',<10>,1:5]

[@-1,6:8='\n ',<6>,1:6]

[@-1,9:42='<![CDATA[\n My CDATA Block\n ]]>',<2>,2:2]

[@-1,43:43='\n',<6>,4:5]

[@-1,44:44='<',<7>,5:0]

[@-1,45:45='/',<13>,5:1]

[@-1,46:49='item',<16>,5:2]

[@-1,50:50='>',<10>,5:6]

However, I would like to separate the '<[[CDATA' start tag and ']]>' end tag from the CDATA event so that I can have an event that contains only the contents of this block. I have tried the following grammar which almost works except that because the CDATA block uses a non-greedy match, every single character in the block creates a new event:

New grammar:

https://gist.github.com/alexspurling/2e243b1c806a4482697700ea1f686d44

(Diff: https://gist.github.com/alexspurling/2e243b1c806a4482697700ea1f686d44/revisions)

Output:

[@-1,0:0='<',<6>,1:0]

[@-1,1:4='item',<15>,1:1]

[@-1,5:5='>',<9>,1:5]

[@-1,6:8='\n ',<5>,1:6]

[@-1,9:17='<![CDATA[',<2>,2:2]

[@-1,18:18='\n',<19>,2:11]

[@-1,19:19=' ',<19>,3:0]

[@-1,20:20=' ',<19>,3:1]

[@-1,21:21=' ',<19>,3:2]

[@-1,22:22=' ',<19>,3:3]

[@-1,23:23='M',<19>,3:4]

[@-1,24:24='y',<19>,3:5]

[@-1,25:25=' ',<19>,3:6]

[@-1,26:26='C',<19>,3:7]

[@-1,27:27='D',<19>,3:8]

[@-1,28:28='A',<19>,3:9]

[@-1,29:29='T',<19>,3:10]

[@-1,30:30='A',<19>,3:11]

[@-1,31:31=' ',<19>,3:12]

[@-1,32:32='B',<19>,3:13]

[@-1,33:33='l',<19>,3:14]

[@-1,34:34='o',<19>,3:15]

[@-1,35:35='c',<19>,3:16]

[@-1,36:36='k',<19>,3:17]

[@-1,37:37='\n',<19>,3:18]

[@-1,38:38=' ',<19>,4:0]

[@-1,39:39=' ',<19>,4:1]

[@-1,40:42=']]>',<18>,4:2]

[@-1,43:43='\n',<5>,4:5]

[@-1,44:44='<',<6>,5:0]

[@-1,45:45='/',<12>,5:1]

[@-1,46:49='item',<15>,5:2]

[@-1,50:50='>',<9>,5:6]

My desired output would be:

[@-1,0:0='<',<7>,1:0]

[@-1,1:4='item',<16>,1:1]

[@-1,5:5='>',<10>,1:5]

[@-1,6:8='\n ',<6>,1:6]

[@-1,9:42='<![CDATA[',<2>,2:2]

[@-1,9:42='\n My CDATA Block\n ',<19>,2:2]

[@-1,9:42=']]>',<18>,2:2]

[@-1,43:43='\n',<6>,4:5]

[@-1,44:44='<',<7>,5:0]

[@-1,45:45='/',<13>,5:1]

[@-1,46:49='item',<16>,5:2]

[@-1,50:50='>',<10>,5:6]

How can I change the grammar to achieve this?

Thanks,

Alex

Reply all

Reply to author

Forward

0 new messages