Hi,
My name is Stefania Mozacu and I’m a third-year Computer Science and Engineering student at the Technical University of Cluj-Napoca.
I’ve been going through the blog posts about the Lark-based LaTeX parser rewrite and the current SymPy issues around it, and I found the project extremely interesting, especially the discussion around ambiguous expressions and the limitations of purely grammar-based parsing.
I’d love to pursue this idea for GSoC and contribute to improving the Lark parser towards becoming a robust drop-in replacement for the existing ANTLR-based implementation.
One direction that stood out to me is the unresolved ambiguity problems mentioned in the posts (for example distinguishing between function calls vs symbols in cases like f(x), or interpreting dx as a differential versus a variable or multiplication). Since these are context-sensitive cases that are difficult to solve at the CFG level, I was thinking about exploring a two-stage approach.
The idea would be to let the Lark parser preserve ambiguity by producing multiple candidate parse trees where necessary, and then introduce a lightweight semantic resolution layer that selects the most consistent interpretation using SymPy’s symbolic reasoning and heuristics (for example expected patterns in integrals/derivatives, symbol consistency, or optional user hints). This could allow the parser to move beyond purely syntactic translation and handle real-world mathematical input more robustly.
From an implementation perspective, I imagine starting with a baseline that keeps compatibility with the current transformer pipeline, then adding a small abstraction for ambiguous nodes (or parse forests) and building a resolver that scores candidate interpretations and returns a final SymPy expression. The work could be incremental, beginning with specific ambiguity classes (function vs symbol, differential handling) and expanding based on test coverage and existing issues.
I’m very open to refining the scope based on what would be most useful for the project, so I’d really appreciate any feedback on whether this direction makes sense or if there are particular gaps you’d recommend focusing on.
Thank you, and I’d be happy to discuss further.
Best,
Stefania