I didn't want to hijack the thread, so I kept the details on my side as scarce as possible. Most is a moving work in progress, so I wouldn't jump into discussing the specifics until after I make a couple of repositories with demonstrations ready to be public. But I agree the surrounding questions are quite interesting.
On the incrementally improving grammar, that is beyond the scope of Marpa and beyond the immediate scope of my current project, but feels like it is an inevitable piece in language understanding. Papers bring along a lot of "local" notations, so a grammatical approach to model them would need to both uncover and recognize said notations. As this hasn't been my main focus, I am only familiar with one recent work on the topic by Kevin Kofler, which departs from the tables of Earley's algorithm and is instead based on runtime graphs that are efficient to update on the fly:
Given it's a recent PhD thesis, consider that with a grain of salt, certainly a problem with lots of open questions to answer.
As to the goal of parsing the arXiv formulas, the most direct application is "semantic search" over the formulas. I'll try to give an example that is easy to think about.
I recently had an exchange where someone stated "obviously a blackboard N (ℕ) can only mean the set of natural numbers". Since I have the arXiv data in machine-readable form I could quickly grep for "let ℕ" and recovered the following snippets from actual papers:
1. Linear algebra - a matrix
- "let ℕ be the matrix given by proposition 2"
2. Category theory
- "let ℕ be a functor of R modules"
- "let ℕ be the category of finite sets with inclusions as morphisms"
3. Information theory - coding instance
- "let ℕ be a secure network coding instance and 𝕀 be the corresponding secure index coding instance obtained using the network to index coding mapping"
4. Topology - a complex
- "let ℕ be the complex containing the point x the argument in wet(x,𝕄) and let ℕ^' be the other complex"
5. Probability theory - a martingale
- "let ℕ be a continuous local martingale with values in ..."
And there are probably others. This is to motivate that likely ALL symbols in expressions have a multitude of meanings, when taken to the arXiv's scale. Hence the need for an approach capable of dealing with the explosion of lexical ambiguity in math syntax. If one could craft rules for all scientific subdomains (as well as one "generic" domain so that well-formed expressions in unknown/new domains succeed to parse), one could get a forest of all coherent parses, while pruning away all senseless readings. The parse forest can then be packed and serialized (in e.g. Content MathML) for downstream applications to reuse. The most direct one being semantic search engines such as MathWebSearch, which can let authors write queries that target a symbol's meaning, rather than its syntactic sign:
As an example, quite neatly, you could ask for an expression involving a variable known to be a "martingale" (by setting a query variable's type), and obtain surprising results in notations one wasn't familiar with (such as the blackboard N here).
There are other applications beyond search, e.g. making expressions easy to transfer into computer algebra systems, eventually heading towards the long-term goal of being able to have a computer assistant walk a paper's equations and find obvious contradictions/errors akin to the way a spellchecker validates words.
I should try to setup the Discord you referred to, could be interesting to co-brainstorm at some point. But I have some work to do on my end before I get back to juggling the parse forests.