Correct way to parse many Java source files and extract all of the method tokens

291 views
Skip to first unread message

Ryan Warren

unread,
Jan 16, 2021, 12:09:45 PM1/16/21
to antlr-discussion
Hello,

I am new to using ANTLR and wanted to know what would be the best way to lex and parse an entire Java project (~50K source files) to extract all their method declarations?

Will I need to to create a new Java8Lexer, CommonTokenStream and Java8Parser for each source file to get the CompilationUnit and then visit all the MethodDeclarations?

Crudely testing this idea seems to take a long time and I am assuming this is due to having to creating a new instance of the Lexer and Parser each time? Would anybody be able to advise me on any better solutions to this problem?

Best wishes,
Ryan

Federico Tomassetti

unread,
Jan 16, 2021, 6:09:01 PM1/16/21
to antlr-di...@googlegroups.com
Hi,
Fit this specific goal I would consider JavaParser, even if it is not based on ANTLR. It can easily do what you need.

Cheers,
Federico 

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/72476945-1762-4caf-9d25-a0d585e55bb1n%40googlegroups.com.
--

Ryan Warren

unread,
Jan 16, 2021, 6:22:04 PM1/16/21
to antlr-discussion
Hi Federico,

Thank you for your reply. Currently I am using JavaParser and have found that it works brilliantly for everything I have needed to do so far. Unfortunately, as I am now trying to support other the use of languages, I was trying to find something that would allow this (From my understanding JavaParser can only parse Java code - but please correct me if I am wrong?). Essentially my end goal is to collect the tokens that are used in a method declaration (both the header and body) and I would like to be able to do the same for several other languages such as Python, JavaScript, and C.

Would you recommend using JavaParser when extracting the method tokens for Java specific files, and alternative parsers (Ideally written in Java), to extract the tokens that are represented in the methods for the other languages?

Cheers,
Ryan

Federico Tomassetti

unread,
Jan 18, 2021, 4:46:17 AM1/18/21
to antlr-di...@googlegroups.com
Hi Ryan,
I think that in the specific case of Java not having to write such a
complex grammar and evolve it over time is a great benefit, as Java
now started to change every other day. For this reason I would
consider "outsourcing" the maintenance of the grammar to an OSS
project, unless you have the energies to keep up with Java's
evolutions. In the context of building a polytlog analysis suite I
would consider wrapping Javaparser and writing parsers for other
languages myself (or collecting them and then build adapters in front
of them). This is how I would do that, but I think it really depends
on your goals and the resources you can dedicate to this.

Cheers,
Federico
> To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/e6d529d4-7c23-4a39-82fb-3bb755a42c70n%40googlegroups.com.

Ryan Warren

unread,
Jan 18, 2021, 5:24:51 AM1/18/21
to antlr-di...@googlegroups.com
Hi Federico,

Thank you for the advice! That does make sense, providing and wrapper for JP was what I was considering. Creating an adapter for each outsourced parser also seems like a good possibility for initial proof of concept. I will have a look around (unless you might have any recommendations?) to see if I can find a JP equivalent for something Python again mainly for proof of concept first.

Cheers,
Ryan

You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/uMAwx3SoX5Y/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/CAFux1qHr5F3r%3D0u_E0UDcnKMaKKSN13P35MF%2Bqf%3DOWZS6Vw%2Bgg%40mail.gmail.com.

iris xu

unread,
Jan 23, 2021, 4:40:42 AM1/23/21
to antlr-discussion
Hello, I have nearly the same concern. I need to parse & merge multiple AST trees of which rules happened to be written in ANTLR forms. However, I found those contexts will become invalid after the according parsers are closed. However, keeping all lexers & tokenstreams & parsers alive seems to be too heavy?

Martin Mirchev

unread,
Jan 23, 2021, 5:22:52 AM1/23/21
to 'edgar hoover' via antlr-discussion
Hi, I think you need to set the property BuildParseTrees to true to keep them alive afterwards.


Mike Lischke

unread,
Jan 23, 2021, 5:35:45 AM1/23/21
to 'rtm...@googlemail.com' via antlr-discussion

Hi, I think you need to set the property BuildParseTrees to true to keep them alive afterwards.

This setting switches the generation of parse trees, not their persistence. You can keep parse trees as long as you want, provided you also keep the token stream intact, because the parse tree contains references to tokens.

On each parse run a new parse tree is generated, if the above setting is set to true.

iris xu

unread,
Jan 23, 2021, 5:37:47 AM1/23/21
to antlr-discussion
I'm sorry but ..
```
    ANTLRInputStream * p_input = new ANTLRInputStream(s);
    SqlBaseLexer * p_lexer = new SqlBaseLexer(p_input);
    CommonTokenStream * p_tokens = new CommonTokenStream(p_lexer);
    p_tokens->fill();
    for(Token *token : p_tokens->getTokens()){
        std::cout << token->toString() << std::endl;
    }
    SqlBaseParser* p_parser = new SqlBaseParser(p_tokens);
    p_parser->setBuildParseTree(true);
    auto tree = p_parser->singleStatement();
    puts("s0");
    puts(parseTree2Str(tree).c_str());
    delete p_lexer;
    puts("s2");
    puts(parseTree2Str(tree).c_str());
    delete p_parser;
    puts("s4");
    puts(parseTree2Str(tree).c_str());
    delete p_tokens;
    puts("s3");
    puts(parseTree2Str(tree).c_str());
    delete p_input;
    puts("s1");
    puts(parseTree2Str(tree).c_str());
    return tree;
```
It always stops at s4, which means both CommonTokenStream & ANTLRInputStream  should not be closed.
Reply all
Reply to author
Forward
0 new messages