[GSoC 2026] Improve the Lark-based LaTeX parser

28 views

Skip to first unread message

Stefania Mozacu

unread,

Feb 26, 2026, 2:20:04 PM (2 days ago) Feb 26

to sympy

Hi,

My name is Stefania Mozacu and I’m a third-year Computer Science and Engineering student at the Technical University of Cluj-Napoca.

I’ve been going through the blog posts about the Lark-based LaTeX parser rewrite and the current SymPy issues around it, and I found the project extremely interesting, especially the discussion around ambiguous expressions and the limitations of purely grammar-based parsing.

I’d love to pursue this idea for GSoC and contribute to improving the Lark parser towards becoming a robust drop-in replacement for the existing ANTLR-based implementation.

One direction that stood out to me is the unresolved ambiguity problems mentioned in the posts (for example distinguishing between function calls vs symbols in cases like f(x), or interpreting dx as a differential versus a variable or multiplication). Since these are context-sensitive cases that are difficult to solve at the CFG level, I was thinking about exploring a two-stage approach.

The idea would be to let the Lark parser preserve ambiguity by producing multiple candidate parse trees where necessary, and then introduce a lightweight semantic resolution layer that selects the most consistent interpretation using SymPy’s symbolic reasoning and heuristics (for example expected patterns in integrals/derivatives, symbol consistency, or optional user hints). This could allow the parser to move beyond purely syntactic translation and handle real-world mathematical input more robustly.

From an implementation perspective, I imagine starting with a baseline that keeps compatibility with the current transformer pipeline, then adding a small abstraction for ambiguous nodes (or parse forests) and building a resolver that scores candidate interpretations and returns a final SymPy expression. The work could be incremental, beginning with specific ambiguity classes (function vs symbol, differential handling) and expanding based on test coverage and existing issues.

I’m very open to refining the scope based on what would be most useful for the project, so I’d really appreciate any feedback on whether this direction makes sense or if there are particular gaps you’d recommend focusing on.

Thank you, and I’d be happy to discuss further.

Best,
Stefania

https://www.linkedin.com/in/stefania-mozacu/

Aaron Meurer

unread,

Feb 26, 2026, 3:37:04 PM (2 days ago) Feb 26

to sy...@googlegroups.com

I'm not familiar with all the implementation details of the Lark
parser, but I think there are some good ideas here.

Personally, I think the LaTeX parser should focus on being something
that can produce consistent and deterministic results. Giving all
possible parse trees is a good way to do this. The reason I think this
is that LLMs can already do fuzzy parsing and translation pretty well.
It's better for the parser to focus on something a language model
cannot do, like being deterministic.

Aaron Meurer

> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/sympy/5faf5f49-5147-4603-b807-33421b79c30en%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages