Java Doc Parsing using ANTLR v4.5 for JAVA

40 views
Skip to first unread message

gautam

unread,
Aug 30, 2016, 1:47:07 AM8/30/16
to antlr-discussion
I'm a trying to parse java doc comments for my java application using ANTLR grammar but the problem is that my output contains escape characters and '*' that is undesirable.I also tried TokenStreamRewriter by channeling DOC_COMMENT to a different channel to modify tokens but didn't achieve anything. Need some help as i'm a newbie to ANTLR  

Grammar : DOC_COMMENT
                 : DocComment 
                 ;

                fragment 
                DocComment 
                         : '/**' .*? '*/' 
                         ;

Sample input :   /**
                  * generate ast for a java source file
                  *   
                  * @param sourceFileName
                  * @return
                  */

Sample Output : /**\r\n\t * generate ast for a java source file\r\n\t * \r\n\t * @param sourceFileName\r\n\t * @return\r\n\t */

Mike Cargal

unread,
Aug 30, 2016, 8:06:54 AM8/30/16
to ANTLR List
If by “my output contains escape characters” you’re referring to \r\n\t etc., that’s just an artifact of how Java represents the string when it contains carriage returns, new  lines, and tabs.  You’re string doesn’t really contain “\r”, it contains a byte with a value of 13 (aka, a carriage return), which Java will print out as “\r” since it has no other visual representation.

Your grammar says that a DocComment is everything from ‘/**’ to ‘*/‘ inclusive, so that includes all whitespace as well as the embedded *’s.  If it’s undesirable, it would be easy enough to remove the beginning /** and trailing */.  You could also use a regex to identify, and remove any instances of a \r\n\s*\* from your string.  This will remove the the carriage return, linefeed, any whitespace before an * and the * itself, so you’ll lose some information regarding the line breaks.  If those are important to you you can get more sophisticated and only remove the *’s that are preceded by a carriage return linefeed and whitespace.

You *could* try a parser rule for docComments, where you get more specific and assign labels to the parts you’re interested in maintaining.  As it is, you’ve defined DOC_COMMENT as a token, so it’s all or nothing:

Using a parser rule would be something more like the following (not tested, so expect to tweak a few typos)


DOC_COMMENT_START : ‘/**’
DOC_COMMENT_END : ‘*/‘
DOC_COMMENT_BREAK : ‘\r?\n\s*\*’

docComment:
DOC_COMMENT_START
cmt=.*?
(DOC_COMMENT_BREAK cmt=.*?)*
DOC_COMMENT_END

This is still going to leave carriage returns and line feeds that aren’t followed by whitespace and an *, so modify accordingly if you want to exclude them as well..  There are a number of ways to handle this depending upon how much of the comment formatting you want to preserve.

In general, I advise folks to avoid getting to clever in the grammar and just expect it to properly identify and classify the input stream.  Then you can write code to handle the details, so there’s something to be said for just sticky with a simple token that you know brings in the whole DOC_COMMENT, and then pulling the information you want out in your own code.

Hope that helps more than it confuses… it’s a bit early in the morning and I may not have done the best job explaining.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ctsHydGTOIndia

unread,
Aug 31, 2016, 5:43:57 AM8/31/16
to antlr-discussion
@Mike Cargal 

Thanks for the help....

As advised to pull the desired information from DOC_COMMENT, i tried and everytime i'm getting the whole DOC_COMMENT in my AST. I don't know how to do it.. 
Plz help me. 

Mike Cargal

unread,
Aug 31, 2016, 10:09:22 AM8/31/16
to ANTLR List
If you’re still calling it DOC_COMMENT, then it’s a lever rule, and not a parser rule.  Lever rules result in a single Lexer token.  Parse rules give you data structures with the parts of the parse rule.

If you’re not really clear on the distinction between the lever and the parser, you *really* need to go back and read up on it.  You’ll have no end of pain if you don’t understand that difference.

In a nutshell… Your input stream is handed to the lexer, which uses lexer rules (those rules that are capitalized).  This creates a stream of tokens which is then used to recognize parser rules (lower case identifiers).  Of course, there’s a bit more to it than that.

Mike Cargal

unread,
Aug 31, 2016, 10:11:46 AM8/31/16
to antlr-discussion
If you’re still calling it DOC_COMMENT, then it’s a lexer rule, and not a parser rule.  Lexer rules result in a single Lexer token.  Parse rules give you data structures with the parts of the parse rule.

If you’re not really clear on the distinction between the lexer and the parser, you *really* need to go back and read up on it.  You’ll have no end of pain if you don’t understand the difference.

In a nutshell… Your input stream is handed to the lexer, which uses lexer rules (those rules that are capitalized).  This creates a stream of tokens which is then used to recognize parser rules (lower case identifiers).  Of course, there’s a bit more to it than that.

Reply all
Reply to author
Forward
0 new messages