[Discussion] [GSoC] Creating a Java Parser

169 views
Skip to first unread message

Gajjar Smit

unread,
Mar 15, 2020, 10:16:57 PM3/15/20
to sympy
Hello everyone,
I am Smit Gajjar, a third year undergraduate student pursuing Computer Engineering at LD College of Engineering, Ahmedabad, India.

I have been exploring sympy since last year by reading codebase, trying out examples as well as solving issues recently. I am looking forward to contribute to sympy through a GSOC project, if I am lucky enough. I have been using Python, C++ for 3 years in contests and projects as well as Java for 4 years in school and college practicals and mini-projects.

I am highly motivated to work towards creating a Java language Parser, which would convert a Java code to corresponding Sympy code. I went through the codebase of C and Fortran Parser under the parsing module(implemented by Nikhil Maan last year). I thoroughly read it and tried to grasp how it works.

I believe, I can do the same for Java language, since I am quite comfortable with Java. I am currently revising the concepts of parser and lexer from our college curriculum of System Programming, Theory of Computation and Compiler Design and also working on examples involving javalang pip package and antlr4, which would make it possible.

(Overview: C Parser converts C syntax to AST using Clang(external optional dependency) and Fortran Parser converts Fortran syntax to ASR using lfortran(external optional dependency). After that, Sympy's own Codegen is responsible to generate Sympy syntax from AST or ASR. Similar will be case for Java.)

Lastly, my question is: Is it okay to carry this forward? If it is, then is there any other documentation/codebase/issues I should go through to better familiarize myself with the current status of parsing module?

Aaron Meurer

unread,
Mar 15, 2020, 10:25:06 PM3/15/20
to sympy
I think it fits, although honestly Java is less important than C or
Fortran. But there are some numerical uses of it.

How developed is the Javalang package? Will it be sufficient to
represent the AST or will you need to work there?

Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/408dfc2b-9780-4bd9-820a-00fdad6f9ccf%40googlegroups.com.

Nikhil Maan

unread,
Mar 16, 2020, 2:41:51 AM3/16/20
to sympy
I tested the package  for some basic syntax and it seems to provide a parse tree representation for the syntax.
But, does it provide any support for traversing the tree and accessing the children nodes, or would you need to implement that on your own?

Also, you should check out the SymPyExpression class at sym_expr.py. I created it as a wrapper to act as a front-end for all the parsers so that the user doesn't have to learn about all the parsers and their APIs differently. The parser you create which will be in the parsing/java module should be called from the SymPyExpression class and the users shouldn't need to interact with the parser directly.

You should also check out asy.py to get familiar with SymPy's AST.

Gajjar Smit

unread,
Mar 16, 2020, 3:12:50 PM3/16/20
to sympy
Thanks Aaron and Nikhil for suggestions. 

I have gone through javalang package and tried out few examples. AST nodes can be traversed in javalang and children can be accessed. Here, every relevant bunch of tokens are represented as an object of superclass CompilationUnit(somewhat similar to TranslationUnit in clang) which is again a subclass of generalized Node class. (Link to the javalang repo)
A basic example of tree traversal is attached as image.


I went through AST hierarchy in sympy ast module. I am also trying to understand generic sympy_expr module.
Should I start working on my proposal in this direction?
javalang1.png

Aaron Meurer

unread,
Mar 16, 2020, 3:16:16 PM3/16/20
to sympy
Another thing to consider, since this would be the third such language
to be supported in SymPy (after C and Fortran), is if there are
commonalities in the parsing code for each that should be factored out
into a helper submodule.

Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/f3d1221b-e731-49e0-9c01-09671f7fbb02%40googlegroups.com.

Gajjar Smit

unread,
Mar 16, 2020, 3:30:44 PM3/16/20
to sy...@googlegroups.com
I will surely look into the commonalities in those modules if they exist and will raise relevant issue!

Gajjar Smit

unread,
Mar 22, 2020, 6:37:32 PM3/22/20
to sympy
Here is the docs link to my GSoC Proposal: https://docs.google.com/document/d/1HFWvBV-NjQVd2Tvv4-_tlNrGG52WWxJUlUlByLxfOog/edit?usp=sharing

Please go through it. Any form of suggestions are most welcome!

Few queries:
  1. Who will be the potential mentor if this project gets selected?
  2. Do let me know if "Overview of Milestones" part after each phase, showing high level implementation of code should have more details or is it sufficient!
Thanks

On Tuesday, 17 March 2020 01:00:44 UTC+5:30, Gajjar Smit wrote:
I will surely look into the commonalities in those modules if they exist and will raise relevant issue!

On Tue 17 Mar, 2020, 12:46 AM Aaron Meurer, <asme...@gmail.com> wrote:
Another thing to consider, since this would be the third such language
to be supported in SymPy (after C and Fortran), is if there are
commonalities in the parsing code for each that should be factored out
into a helper submodule.

Aaron Meurer

On Mon, Mar 16, 2020 at 1:12 PM Gajjar Smit <smitga...@gmail.com> wrote:
>
> Thanks Aaron and Nikhil for suggestions.
>
> I have gone through javalang package and tried out few examples. AST nodes can be traversed in javalang and children can be accessed. Here, every relevant bunch of tokens are represented as an object of superclass CompilationUnit(somewhat similar to TranslationUnit in clang) which is again a subclass of generalized Node class. (Link to the javalang repo)
> A basic example of tree traversal is attached as image.
>
>
> I went through AST hierarchy in sympy ast module. I am also trying to understand generic sympy_expr module.
> Should I start working on my proposal in this direction?
>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+unsubscribe@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/f3d1221b-e731-49e0-9c01-09671f7fbb02%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sympy+unsubscribe@googlegroups.com.

Nikhil Maan

unread,
Mar 23, 2020, 11:12:03 AM3/23/20
to sympy
I looked through the proposal and have some suggestions. Can you provide comment access to the people with the link? I think it'd be better to comment directly on the proposal.
As for your queries:
1) I will be mentoring for the project.
2) No, you do not need to provide more details in the Milestone overviews, but you can use something like implement variable declarations instead of transform_variable_declaration() . I think a little more can be done each month than what is currently proposed.

If you are familiar with C and C++ and its syntax, and seeing you have a few PRs improving the current parsers, you are familiar with the current parsers,
Can you also improve the current C parser during this GSoC period, like implementing loops and other stuff which are not currently implemented in the C parser, but you are proposing for the Java parser?

Regards,
Nikhil Maan

On Monday, March 23, 2020 at 4:07:32 AM UTC+5:30, Gajjar Smit wrote:
Here is the docs link to my GSoC Proposal: https://docs.google.com/document/d/1HFWvBV-NjQVd2Tvv4-_tlNrGG52WWxJUlUlByLxfOog/edit?usp=sharing

Please go through it. Any form of suggestions are most welcome!

Few queries:
  1. Who will be the potential mentor if this project gets selected?
  2. Do let me know if "Overview of Milestones" part after each phase, showing high level implementation of code should have more details or is it sufficient!
Thanks
On Tuesday, 17 March 2020 01:00:44 UTC+5:30, Gajjar Smit wrote:
I will surely look into the commonalities in those modules if they exist and will raise relevant issue!

On Tue 17 Mar, 2020, 12:46 AM Aaron Meurer, <asme...@gmail.com> wrote:
Another thing to consider, since this would be the third such language
to be supported in SymPy (after C and Fortran), is if there are
commonalities in the parsing code for each that should be factored out
into a helper submodule.

Aaron Meurer

On Mon, Mar 16, 2020 at 1:12 PM Gajjar Smit <smitga...@gmail.com> wrote:
>
> Thanks Aaron and Nikhil for suggestions.
>
> I have gone through javalang package and tried out few examples. AST nodes can be traversed in javalang and children can be accessed. Here, every relevant bunch of tokens are represented as an object of superclass CompilationUnit(somewhat similar to TranslationUnit in clang) which is again a subclass of generalized Node class. (Link to the javalang repo)
> A basic example of tree traversal is attached as image.
>
>
> I went through AST hierarchy in sympy ast module. I am also trying to understand generic sympy_expr module.
> Should I start working on my proposal in this direction?
>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sy...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/f3d1221b-e731-49e0-9c01-09671f7fbb02%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sy...@googlegroups.com.

Gajjar Smit

unread,
Mar 23, 2020, 11:33:34 AM3/23/20
to sy...@googlegroups.com
Thanks for the feedback, Nikhil!
I did not know that I disabled comment access. I tried to resolve that. Please see if it is accessible now.

Also, I can include, improving C parser if that can also be done in the timeline since I am now quite familiar with the C parser code. Currently, while statement, for statement,  break token, continue token and if statement are not implemented in C parser code. Initially, I thought, it might not be completed in the summer. But, I can definitely include that now in the proposal along with Java parser.


To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/d3eefb38-7af2-45d1-ad3f-1c4d11be805f%40googlegroups.com.

Gajjar Smit

unread,
Mar 23, 2020, 5:29:53 PM3/23/20
to sy...@googlegroups.com
Hi, I need a suggestion regarding the timeline.

Should I distribute the remaining implementation of C parser(implementing loops and other functions) across all 3 phases or should I complete it in the first phase and then proceed to Java Parser? Suggest any other way if it is better!

Aaron Meurer

unread,
Mar 24, 2020, 11:53:18 AM3/24/20
to sympy
I think either way is fine. If you suspect things will come up in the
C parser that you won't notice until you have done work in the Java
parser, then it makes sense to split it. Otherwise, it is cleanest to
do it all at once.

Aaron Meurer
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CA%2B%2BiR_zmbZGyCtQcsou5%2Bo713S6XEV%2Bn4Pjv9LXozt7pUhxcMw%40mail.gmail.com.

Gajjar Smit

unread,
Mar 24, 2020, 6:26:45 PM3/24/20
to sympy
Agreed. First, finish off with most part of C parser(since it has more priority), before digging into Java parser and then, whatever is left in C can be done along with Java.
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/d3eefb38-7af2-45d1-ad3f-1c4d11be805f%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sy...@googlegroups.com.

Gajjar Smit

unread,
Mar 26, 2020, 7:22:38 AM3/26/20
to sy...@googlegroups.com
I got a serious query regarding usage of Clang in C parser for further implementation cases such as implementing assignment and loops.

Actually, I was trying to implement conversion for "if statement" in C parser. So, after traversal, what I got from the AST was IF_STMT. And in its child node, BINARY_OPERATOR was obtained. Now naturally, I wanted to know the type of binary operator  But, unfortunately, I found out that the type of binary operator(e.g. =, <=, +, -, /, *) cannot yet be obtained from clang python bindings (i.e clang.cindex), because the same is currently under development. Later, I found that the type of unary operators also cannot be obtained.

For implementing loops too, we will need to parse assignment operator(a binary operator) as well as increment/decrement(unary operators). Same will be the case for implementing aug_assign(shorthand operators/ simple perform operation and assign in C)

Conclusion is that either we have to wait until the release of Clang which will contain these improvements or find some alternatives.

Please suggest what to do in this context as soon as possible! One option can be implementing it from scratch using pycparser, which is very well developed. Even if you agree with this, I will start making PRs(which use pycparser) soon and cope up with the current situation for C parser until GSoC (since I am available now because of current condition in the nation)

Other option can be changing clang.cindex and some cpp files in clang, which is advised by some people of clang community, but that doesn't seem feasible from a user point of view.

Also, I have updated my proposal according to the suggested changes. Please review it and suggest appropriate changes. I have to update it soon since deadline is near! (I will have to change Clang to Pycparser if in case. I have already started learning it.)


To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/b719a8e4-3004-449b-9d90-6adcea3703de%40googlegroups.com.

Gajjar Smit

unread,
Mar 27, 2020, 2:20:44 PM3/27/20
to sympy
I have successfully parsed variable declaration in C using pycparser and stored it inside sympy/parsing/c2/c_parser2.py. I am making a PR! Please have a look. If you feel that it is good to do it for other implementations, please suggest!

Nikhil Maan

unread,
Mar 27, 2020, 2:57:33 PM3/27/20
to sympy
I just checked, pycparser only supports C99 and some features of C11 and I don't think it has any support for C++ syntax.

The python bindings for Clang might be less developed, but Clang is clearly more developed than most of the alternatives.

And as far as I've seen, the Clang AST supports these features and many others completely, just the bindings are lacking. So, I think it'll be better to stick to Clang for now.
Let me see what can be done for the python bindings.

Regards,
Nikhil Maan

Gajjar Smit

unread,
Mar 27, 2020, 3:37:59 PM3/27/20
to sympy
Yes, as you said, it only supports C99 and C11(partially). It doesn't support C++ as well. Suggest whatever is necessary.

Also, do review the proposal whenever time permits(I have kept Clang in that as of now), since deadline is approaching.

Ondřej Čertík

unread,
Mar 27, 2020, 5:35:18 PM3/27/20
to sympy
Hi Gajjar,

I read through your proposal. It looks good overall.

I would suggest to spend more time on the C side than Java, I think C would be very useful for a lot of people. I don't know how many Java users are there that would use the Java backend.

Also, if you are interested at all to improve the Fortran integration, let me know. I am looking for students again this year for that.


Ondrej

On Fri, Mar 27, 2020, at 1:37 PM, Gajjar Smit wrote:
> Yes, as you said, it only supports C99 and C11(partially). It doesn't
> support C++ as well. Suggest whatever is necessary.
>
> Also, do review the proposal
> <https://docs.google.com/document/d/1HFWvBV-NjQVd2Tvv4-_tlNrGG52WWxJUlUlByLxfOog/edit?usp=sharing> whenever time permits(I have kept Clang in that as of now), since deadline is approaching.
>
> On Saturday, 28 March 2020 00:27:33 UTC+5:30, Nikhil Maan wrote:
> > I just checked, pycparser only supports C99 and some features of C11 and I don't think it has any support for C++ syntax.
> >
> > The python bindings for Clang might be less developed, but Clang is clearly more developed than most of the alternatives.
> >
> > And as far as I've seen, the Clang AST supports these features and many others completely, just the bindings are lacking. So, I think it'll be better to stick to Clang for now.
> > Let me see what can be done for the python bindings.
> >
> > Regards,
> > Nikhil Maan
> >
> > On Friday, March 27, 2020 at 11:50:44 PM UTC+5:30, Gajjar Smit wrote:
> >> I have successfully parsed variable declaration in C using pycparser and stored it inside sympy/parsing/c2/c_parser2.py. I am making a PR! Please have a look. If you feel that it is good to do it for other implementations, please suggest!
>
> --
> You received this message because you are subscribed to the Google
> Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sympy/292c30cf-8e97-4829-848a-2e868d9023dc%40googlegroups.com <https://groups.google.com/d/msgid/sympy/292c30cf-8e97-4829-848a-2e868d9023dc%40googlegroups.com?utm_medium=email&utm_source=footer>.

Gajjar Smit

unread,
Mar 28, 2020, 7:25:47 AM3/28/20
to sy...@googlegroups.com
Hi Ondrej Sir,

Thank you for the suggestions. I will look into it and update it accordingly.

I think I can start C Parser implementations from Community Bonding Period itself and start making PRs(there is a whole month and I don't think, it is justifiable only for Community Bonding! Also, I have started working on few implementations from now too, but because of some college assignments to be submitted on Google Classroom, I am getting lesser time nowadays)

Regarding Fortran integration, I went through the current implementation(fortran_parser) and I think I can encompass its remaining implementation work in last 2-2.5 weeks of Phase 3, since code for Java Parser is expected to be 80-85 percent ready till then. If few things are still left in the Fortran parser, I would love to continue after GSoC too!

But, I would like to let you know that it is necessary for me to first get approval from you and my current potential mentor Nikhil in this regard, since I am participating for the first time and I might have underestimated the time to complete the milestones. 

Also, do note that I have never worked with Fortran, so I will have to spend almost a week on that!(I have worked on many programming languages, so it won't take much time to get used to it)

Thanks again!

Nikhil Maan

unread,
Mar 28, 2020, 5:22:46 PM3/28/20
to sympy
If you also plan to help out with Fortran, since you'll be improving the previous parsers, I think it'll be better to work on the C and Fortran parser together during the initial period.

I'll suggest something like Week 1-6 or the C and Fortran Parser  and Week 7-11 on the Java Parser along with the tests and stuff, and keeping week 12 as a buffer.
That's my rough estimate. You can tune it as per your comfort.

Also, don't worry about learning Fortran, If you know C, you'll be able to learn it easily during the community bonding period. I did too :)
And you can ping us if you need any help regarding that.

As for the C parser backend, I'd suggest we stick to Clang for now. It's definitely better than pycparser. I'm looking into what can be done for the bindings.

Regards,
Nikhil Maan
> an email to sy...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to sy...@googlegroups.com.

Gajjar Smit

unread,
Mar 29, 2020, 6:17:55 PM3/29/20
to sympy
I have updated the proposal according to the latest suggestions. Please review the final draft. I am uploading it to the website now and if required, I will update it!

Also, thanks for the guidance about Fortran :)

Peter Kayode

unread,
Mar 29, 2020, 7:00:10 PM3/29/20
to sy...@googlegroups.com
Thanks sir
The only problem I now have is the proposal writing sir
I don't know how to go about it

To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/aba4fd5d-b62d-418a-8400-1ae8baad7d1f%40googlegroups.com.

Ondřej Čertík

unread,
Mar 30, 2020, 1:04:53 PM3/30/20
to sympy
Thanks Gajjar. I agree with Nikhil and overall this looks good.

Ondrej

On Sun, Mar 29, 2020, at 4:17 PM, Gajjar Smit wrote:
> I have updated the proposal
> <https://docs.google.com/document/d/1HFWvBV-NjQVd2Tvv4-_tlNrGG52WWxJUlUlByLxfOog> according to the latest suggestions. Please review the final draft. I am uploading it to the website now and if required, I will update it!
> an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sympy/aba4fd5d-b62d-418a-8400-1ae8baad7d1f%40googlegroups.com <https://groups.google.com/d/msgid/sympy/aba4fd5d-b62d-418a-8400-1ae8baad7d1f%40googlegroups.com?utm_medium=email&utm_source=footer>.

Gajjar Smit

unread,
Mar 30, 2020, 2:33:38 PM3/30/20
to sy...@googlegroups.com
Thank you for the review, Sir!

I will be uploading pdf of this final version!

Also, I got a workaround for parsing binary operators in Clang(using tokens). I will make a PR soon.

Thanks

Reply all
Reply to author
Forward
0 new messages