tests for parsing Latex input to Sympy

63 views
Skip to first unread message

Ben

unread,
May 24, 2020, 3:01:09 AM5/24/20
to sympy
Hello,

I'm using Sympy to parse mathematical expressions written in Latex. I have observed that parsing Latex does not always work, so I've been collaborating with a friend to modify the ANTLR grammar file to address some of the issues we have encountered. The repo with the changed files (as well as a Dockerfile to configure the environment and build Sympy with the modified grammar) is https://github.com/allofphysicsgraph/sympy-grammar-modifications

I'm interested in contributing the modified grammar file to Sympy and I have not contributed to Sympy before. I've read the Sympy workflow documentation.
My background: I've been using Python for about 15 years and am comfortable with git and branching. 
Prior to making the pull request, I have a question. 

I don't see where the current grammar file for parsing Latex is tested. Looking at the script https://github.com/sympy/sympy/blob/master/bin/test doesn't lead me to insights.  Also, I don't see any tests defined in the directory https://github.com/sympy/sympy/tree/master/sympy/parsing/latex 

I want to eventually make a pull request regarding the Latex parsing grammar, but I don't know where to create tests that would validate the changes. I'd like to be able to demonstrate that the changes are not breaking Sympy. 

Kindly,

Ben


Ben

unread,
May 24, 2020, 11:33:20 AM5/24/20
to sympy

Aaron Meurer

unread,
May 24, 2020, 11:30:38 PM5/24/20
to sympy
It would be great to have improvements to the LaTeX parser. Let us
know if you have any issues opening a pull request.

The test_latex.py file is correct. test_sympy_parser.py has tests for
the Python parser, which isn't related to the LaTeX parser as far as I
know.


Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/70932dbc-c528-4cf3-a8a3-f9999fc74209%40googlegroups.com.

Ben

unread,
May 25, 2020, 5:36:58 PM5/25/20
to sympy
In the process of working on handling spaces in latex, I had two realizations. First, spaces in Latex math could mean "multiply two variables" or it could just be a way of managing layout of the expression. (I posted examples in https://github.com/sympy/sympy/issues/19075).  

My second realization was that it might be easier to remove the aspects of the Latex string that are related to presentation. Specifically in this case, replace a Latex string's "\ " with " " and replace "\," with " " before passing the string to Sympy. 

Kindly,

Ben

On Sunday, May 24, 2020 at 11:30:38 PM UTC-4, Aaron Meurer wrote:
It would be great to have improvements to the LaTeX parser. Let us
know if you have any issues opening a pull request.

The test_latex.py file is correct. test_sympy_parser.py has tests for
the Python parser, which isn't related to the LaTeX parser as far as I
know.


Aaron Meurer

On Sun, May 24, 2020 at 9:33 AM Ben <ben.is...@gmail.com> wrote:
>
> To answer my own question, I think I found the tests:
> https://github.com/sympy/sympy/blob/master/sympy/parsing/tests/test_latex.py
> https://github.com/sympy/sympy/blob/master/sympy/parsing/tests/test_sympy_parser.py
>
>
> On Sunday, May 24, 2020 at 3:01:09 AM UTC-4, Ben wrote:
>>
>> Hello,
>>
>> I'm using Sympy to parse mathematical expressions written in Latex. I have observed that parsing Latex does not always work, so I've been collaborating with a friend to modify the ANTLR grammar file to address some of the issues we have encountered. The repo with the changed files (as well as a Dockerfile to configure the environment and build Sympy with the modified grammar) is https://github.com/allofphysicsgraph/sympy-grammar-modifications
>>
>> I'm interested in contributing the modified grammar file to Sympy and I have not contributed to Sympy before. I've read the Sympy workflow documentation.
>> My background: I've been using Python for about 15 years and am comfortable with git and branching.
>> Prior to making the pull request, I have a question.
>>
>> I don't see where the current grammar file for parsing Latex is tested. Looking at the script https://github.com/sympy/sympy/blob/master/bin/test doesn't lead me to insights.  Also, I don't see any tests defined in the directory https://github.com/sympy/sympy/tree/master/sympy/parsing/latex
>>
>> I want to eventually make a pull request regarding the Latex parsing grammar, but I don't know where to create tests that would validate the changes. I'd like to be able to demonstrate that the changes are not breaking Sympy.
>>
>> Kindly,
>>
>> Ben
>>
>>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sy...@googlegroups.com.

David Bailey

unread,
May 25, 2020, 6:26:01 PM5/25/20
to sy...@googlegroups.com
On 25/05/2020 22:36, Ben wrote:
In the process of working on handling spaces in latex, I had two realizations. First, spaces in Latex math could mean "multiply two variables" or it could just be a way of managing layout of the expression. (I posted examples in https://github.com/sympy/sympy/issues/19075).  

My second realization was that it might be easier to remove the aspects of the Latex string that are related to presentation. Specifically in this case, replace a Latex string's "\ " with " " and replace "\," with " " before passing the string to Sympy. 

Hi Ben,

I don't want to discourage you in any way, and I may be naive, but I'd have thought LaTex would always be ambiguous one way or another - particularly if it is hand written. I'd have thought the best solution in the long term would be if people wrote their equations in SymPy and then generated LaTex with the latex() function.

David

Ben

unread,
May 25, 2020, 6:42:47 PM5/25/20
to sympy

Hi Ben,

I don't want to discourage you in any way, and I may be naive, but I'd have thought LaTex would always be ambiguous one way or another - particularly if it is hand written. I'd have thought the best solution in the long term would be if people wrote their equations in SymPy and then generated LaTex with the latex() function.

David


 You're totally correct -- Latex is ambiguous. I don't find your observation discouraging since it is perfectly reasonable. 

The issue I'm interested in tackling is the conversion of math presented in Physics papers (e.g., .tex files on arxiv.org) to a semantically meaningful and unambiguous representation (e.g., Sympy). 

This issue would be moot if Physics papers were written in Sympy.  I don't have insight on how to construct incentives that would lead to use of Sympy in Physics papers, so I'm working on the Latex-to-Sympy approach. 

Kindly,

Ben

David Bailey

unread,
May 26, 2020, 7:23:42 AM5/26/20
to sy...@googlegroups.com
On 25/05/2020 23:42, Ben wrote:
 You're totally correct -- Latex is ambiguous. I don't find your observation discouraging since it is perfectly reasonable. 

The issue I'm interested in tackling is the conversion of math presented in Physics papers (e.g., .tex files on arxiv.org) to a semantically meaningful and unambiguous representation (e.g., Sympy). 

This issue would be moot if Physics papers were written in Sympy.  I don't have insight on how to construct incentives that would lead to use of Sympy in Physics papers, so I'm working on the Latex-to-Sympy approach. 

Right - well in that case, maybe a system of hints that the user could add to your parser, would be really useful. For example if a user could tell your parser that superscripts were usually tensor subscripts rather than exponents (or alternatively that certain symbols used as superscripts would never mean exponents) you could come out with a better translation. Another useful hint, might be a list of the multi-letter symbols in use - sin, cos, exp, ln etc. so that you could resolve your ambiguity of what ab means - I mean sometimes sin(x) might mean s*i*n(x) and that could be handled by user specifying that only certain  multi-letter symbols were in use.

David


Ben

unread,
May 26, 2020, 7:33:14 AM5/26/20
to sympy
Yeah, in talking this over with a collaborator about this, we think there are various sources to help with parsing. 
  • within the math latex string to parse, what can be deduced about the expected context?
  • given other math expressions in the same paper, what would be consistent?
  • given the text in a paper surrounding the math expressions, what would be expected based on keywords?
  • given other papers in the same domain or based on citations, what would be likely?
  • what is statistically likely give the corpus of all articles?
This is, in some sense, the same process a human goes through to decode the intended meaning of any given math expression in a scientific paper. We are looking to encode that process as a Python program. (That's beyond the scope of Sympy but is context for the issue.)
 

Francesco Bonazzi

unread,
May 27, 2020, 4:46:26 AM5/27/20
to sympy
Parsing a LaTeX expression should ideally return candidate SymPy expressions with a matching probability. In case of unambiguous matching, only one expression should have a high matching probability. In case of ambiguous matching, two or more SymPy expressions should have high probability.

Topic also matters. If you have a physics paper, you'd probably want it to match some particular kind of expression subsets.
Reply all
Reply to author
Forward
0 new messages