Need help about ANTLR4 grammar for parsing custom commit message.

60 views
Skip to first unread message

Jeffry Angtoni

unread,
Jul 11, 2018, 2:38:19 AM7/11/18
to antlr-di...@googlegroups.com

I just try to create a custom commit message parser for my personal project. The header of the commit template is shown below.

# [<type>]: (If applied, this commit will...) <subject> (Max 50 char)
[FEATURE]:  Fix boot issue related to #354 inside ABC service [sytstemd-service].
Refactoring service.

Currently the grammar for this header is:

grammar StrictCommit ;
 
/*
 * Parser Rules
 */
commit
:
content EOF 
;
 
content
:
header
;
 
header
:
header_type header_type_border header_text
;
 
header_type
:
HEADER_TYPE_OPEN header_type_value HEADER_TYPE_CLOSE
;
 
header_type_value
:
HEADER_TYPE
;
 
header_type_border
:
HEADER_TYPE_BORDER
;
 
header_text
:
~[\r\n]+ (' ' | '\r'? '\n' | '\r' | header_text)*      // TODO Fix Rules
;
 
/*
 * Lexer Rules
 */ 
SKIP_TOKEN
:
(COMMENT | NEWLINE+) -> skip
;
 
HEADER_TYPE_OPEN
:
BRACKET_OPEN
;
 
HEADER_TYPE_CLOSE
:
BRACKET_CLOSE
;
 
HEADER_TYPE_BORDER
:
COLON WHITESPACES?
;
 
HEADER_TYPE
:
UPPERCASE+
;
 
WHITESPACES
:
WHITESPACE+
;
 
/*
 * Symbols
 */
//DOT                 : '.' ;
COMMA               : ',' ;
BRACKET_OPEN        : '[' ;
BRACKET_CLOSE       : ']' ;
COLON               : ':' ;
DOLLAR              : '$' ;
 
/*
 * Fragments
 */
fragment UPPERCASE
:
[A-Z]
;
 
fragment LOWERCASE
:
[a-z]
;
 
fragment DIGIT
:
[0-9]
;
 
fragment WHITESPACE
:
[ \t]
;
 
fragment NEWLINE
:
('\r'? '\n' | '\r')
;
 
fragment COMMENT
:
'#' ~[\r\n]*
;

I still stuck with the header_text rule. My expected result is the header_text rule only matches the header subject text, which is "Fix boot issue...Refactoring service.". But, the ANTLR4 always matchs the header_type rule with header_text rule. Could be my grammar ambiguous or I should change my expected result become:

[FIX]: FGH Fix something new inside bootstrap version 4. Contact @bob for more information. FGH

FGH is the begin and end-point, like heredoc in bash universe. Any suggestions are appreciated.

Mike Cargal

unread,
Jul 12, 2018, 10:47:00 AM7/12/18
to ANTLR List
(brief mandatory mention that this could readily be done with a regex and capture groups, but assuming you just want to use ANTLR)

I'm not sure where the idea of Lexer Rules and Symbols as separate things comes in, but it's likely part of your problem).

Your "Symbols" are lexer rules (anything capitalized is a lexer rule.

Lexer rules can be built up from lexer fragments, but you shouldn't reference one Lexer rule in another Lexer rule (a bit surprised this doesn't generate an error actually)

You're probably not getting the tokens you expect as a result.  (try running grun with the -tokens option)

for example, you probably want to skip defining HEADER_TYPE_OPEN and HEADER_TYPE_CLOSE and just use the rule:

header_type : BRACKET_OPEN header_type_value BRACKET_CLOSE

not sure what else may be wrong, but that "jumps out"


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Jeffry Angtoni

unread,
Jul 13, 2018, 12:48:10 AM7/13/18
to antlr-discussion
Yes... I think it will need more effort if I use antlr for parsing the commit message. Now, I've found a solution by using regex to parse the commit message. I think it is less effort and only need some optimization in the regex, so the step to match is not worst.
Reply all
Reply to author
Forward
0 new messages