71 views

Skip to first unread message

Oct 26, 2019, 7:13:47 AM10/26/19

to marpa parser

Here is one of my recent endeavors in coding that I am particularly satisfied with. It is a rewrite of a Shared Packed Parse Forest. I found a way to greatly simplify it, resulting in code that is 680 lines long. The key insight for Earley parsing is that partial parses can be sorted through online sorting. If they enter the parse forest bottom-up, left-to-right*, they are ready for traversal. The only tree traversal left to do to discard failed partial parses, which won't be a part of the full parse. I'm doing this traversal with a mark-and-sweep algorithm similar to the ones used in GC, or mark-and-ignore. Online sort is done with a simple binary (max-)heap.

Jeffrey Kegler remarked that he didn't achieve mathematical simplicity in context-free parsing that he wished for. I did two complete rewrites of my parse forest. The first code was loosely based on that of Jeffrey Kegler's with nooks. The second was my own revision of a parse forest. In the latest, third version, I achieved mathematical simplicity that Jeffrey Kegler pursued from the beginning.

You may take a look at the code at https://github.com/pczarn/gearley/tree/master/src/forest. The most important files are "bocage", "node" and "traverse". The bocage is Jeffrey Kegler's term for a parse forest. The file contains creation of both the bocage and its partial parses.

Next is "node", which has a packed representation of nodes in 12 bytes. If I could afford to lose this packed representation, the code could be even shorter, but nodes would be 16 bytes in size. Now, I could implement nodes with variable sizes -- some would be 4 or 8 bytes (1 or 2 integers) in size, and the largest 12 bytes in size.

Last but not least, traverse is used for evaluating the full parse. As I already explained, it is a simple loop though all nodes in the order they entered the forest.

The code has abstraction that allows you to plug in a so-called null forest, skipping forest use and construction entirely, leaving only the recognizer at work. Alternatively, you can plug in your own implementation of a forest, but I don't foresee any use of this except perhaps in a parser generator for extremely optimized parsers, where each generated parser has its own optimal forest implementation.

- * I guess it's called rightmost derivation, or x-most derivation, but I'm not sure.

Oct 26, 2019, 7:26:47 AM10/26/19

to marpa parser

Next up, I am going to implement a runtime mode for my parser generator, skipping the need for a bootstrapping it. All processing will be done at runtime. Stage 1 will generate Panini using a runtime parser, and stage 2 will generate Panini with stage 1 result.

Oct 26, 2019, 10:46:40 AM10/26/19

to marpa parser

Hi Piotr,

This is very cool work! And I am doubly excited it is happening in Rust.

I am a very slow-moving end-user of Shared Packed Parse Forests, as I am gradually (after a multi-year pause) gearing up to parsing all 500 million formulas from arXiv with an incrementally improving ambiguous Marpa grammar. I also happen to be using Rust nowadays and am hoping to offer at least one take on a pragmatic macro language for writing grammars in Rust (currently cleaning up the initial prototype). I failed to get anywhere near to rewriting the SLIF code (https://github.com/jrobsonchase/marpa/issues/2) but maybe a more naive first set of steps will get me closer.

While I don't know much on the bocage internals front, just wanted to send an encouraging "this looks cool!" reply - having SPPFs that are extremely efficient while elegantly designed sounds very attractive.

I hope the best parts of the work end up landing in the marpa codebase itself, and maybe the Rust talk spikes additional interest. In the latter direction, I recently had a brief exchange with Per from the c2rust team (https://github.com/immunant/c2rust) who seem to be very close to cross-compiling some hallmark C projects (e.g. libxml2) into fully functional Rust. Using c2rust may be an easy "cheat" to transport the thin layer of marpa natively into Rust, potentially unifying a number of efforts. Just listing my dreams out loud for the moment, and expressing some interest here :-)

Greetings,

Deyan

Oct 26, 2019, 11:39:25 AM10/26/19

to Marpa Parser Mailing LIst

I'm currently immersed in a rewrite of the Marpa theory paper, and until I'm done with that won't have the cycles to give this work the sort of serious attention this deserves. Some random notes:

"Bocage" is my term, and I developed them independently, but Elizabeth Scott preceded me, and did a very nice write-up. So close are the two that her paper is the write-up I might have needed to do, except that hers is probably better than what I would have turned out.

In Marpa::R3, the bocage is being eliminated, in favor of evaluation based on the ASF's of Marpa::R2.

I look forward to a closer look, but alas it cannot happen immediately.

--

You received this message because you are subscribed to the Google Groups "marpa parser" group.

To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/d8d19f26-e065-4a9d-908f-c15a7d3ee87c%40googlegroups.com.

Oct 26, 2019, 11:39:41 AM10/26/19

to marpa parser

Hi Deyan,

Could you give an example of a formula and its effect? Is it grammar synthesis from sentences? Could you link papers that implement incrementally improving grammars? This is interesting. I'm sure I've read a paper about grammar synthesis from sentences. I would like to implement this in Rust. What is the use case for this, with arXiv as input, though?

Thanks for encouragment, I highly appreciate it. If you would like to know anything about bocage internals, ping me on Discord or here, I can explain it all. If so, we would need a starting point -- to what extent are you familiar with SPPF and its implementation?

Cheers

--Piotr

Oct 26, 2019, 1:07:49 PM10/26/19

to marpa parser

Hi again,

I didn't want to hijack the thread, so I kept the details on my side as scarce as possible. Most is a moving work in progress, so I wouldn't jump into discussing the specifics until after I make a couple of repositories with demonstrations ready to be public. But I agree the surrounding questions are quite interesting.

On the incrementally improving grammar, that is beyond the scope of Marpa and beyond the immediate scope of my current project, but feels like it is an inevitable piece in language understanding. Papers bring along a lot of "local" notations, so a grammatical approach to model them would need to both uncover and recognize said notations. As this hasn't been my main focus, I am only familiar with one recent work on the topic by Kevin Kofler, which departs from the tables of Earley's algorithm and is instead based on runtime graphs that are efficient to update on the fly:

Given it's a recent PhD thesis, consider that with a grain of salt, certainly a problem with lots of open questions to answer.

As to the goal of parsing the arXiv formulas, the most direct application is "semantic search" over the formulas. I'll try to give an example that is easy to think about.

I recently had an exchange where someone stated "obviously a blackboard N (ℕ) can only mean the set of natural numbers". Since I have the arXiv data in machine-readable form I could quickly grep for "let ℕ" and recovered the following snippets from actual papers:

1. Linear algebra - a matrix

- "let ℕ be the matrix given by proposition 2"

2. Category theory

- "let ℕ be a functor of R modules"

- "let ℕ be the category of finite sets with inclusions as morphisms"

3. Information theory - coding instance

- "let ℕ be a secure network coding instance and 𝕀 be the corresponding secure index coding instance obtained using the network to index coding mapping"

4. Topology - a complex

- "let ℕ be the complex containing the point x the argument in wet(x,𝕄) and let ℕ^' be the other complex"

5. Probability theory - a martingale

- "let ℕ be a continuous local martingale with values in ..."

- "let ℕ be the matrix given by proposition 2"

2. Category theory

- "let ℕ be a functor of R modules"

- "let ℕ be the category of finite sets with inclusions as morphisms"

3. Information theory - coding instance

- "let ℕ be a secure network coding instance and 𝕀 be the corresponding secure index coding instance obtained using the network to index coding mapping"

4. Topology - a complex

- "let ℕ be the complex containing the point x the argument in wet(x,𝕄) and let ℕ^' be the other complex"

5. Probability theory - a martingale

- "let ℕ be a continuous local martingale with values in ..."

And there are probably others. This is to motivate that likely ALL symbols in expressions have a multitude of meanings, when taken to the arXiv's scale. Hence the need for an approach capable of dealing with the explosion of lexical ambiguity in math syntax. If one could craft rules for all scientific subdomains (as well as one "generic" domain so that well-formed expressions in unknown/new domains succeed to parse), one could get a forest of all coherent parses, while pruning away all senseless readings. The parse forest can then be packed and serialized (in e.g. Content MathML) for downstream applications to reuse. The most direct one being semantic search engines such as MathWebSearch, which can let authors write queries that target a symbol's meaning, rather than its syntactic sign:

As an example, quite neatly, you could ask for an expression involving a variable known to be a "martingale" (by setting a query variable's type), and obtain surprising results in notations one wasn't familiar with (such as the blackboard N here).

There are other applications beyond search, e.g. making expressions easy to transfer into computer algebra systems, eventually heading towards the long-term goal of being able to have a computer assistant walk a paper's equations and find obvious contradictions/errors akin to the way a spellchecker validates words.

I should try to setup the Discord you referred to, could be interesting to co-brainstorm at some point. But I have some work to do on my end before I get back to juggling the parse forests.

Greetings,

Deyan

Oct 30, 2019, 4:12:18 PM10/30/19

to marpa parser

Thanks to some optimizations, I achieved a parser where forest construction has only 15% overhead over mere recognition. I'm achieving ~ 200ns per token on i7 gen 7. I have yet to compare the speed with that of YAEP.

Nov 19, 2019, 5:32:59 AM11/19/19

to marpa parser

I am surprised by the effectiveness of this new optimization. I managed to keep recognizer's memory use down to 56KB while parsing 10,000 lines of C. This way, the parse runs 25% faster. Next up, I should optimize the forest's memory use which uses 12MB during that same parse. I may be able to bring that down by 50% or more. One useful thing to have would be pruning of useless SPPF nodes.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu