Questions on isolating utterances before completely parsing

14 views
Skip to first unread message

symuyn

unread,
Oct 14, 2010, 7:13:49 PM10/14/10
to lojban
I've got a hypothetical problem. It's pretty long, but please bear
with me.

Let's say that, hypothetically, someone is creating a text editor for
Lojban, one which shows the syntactical structure of whatever you've
typed *while you type*. The text would be displayed somewhat like
this:

‹mi ‹‹klama klama› ‹klama bo klama›››

Let's also imagine, hypothetically, that this person has made the
editor pre-parse all whitespace/dot-separated chunks of text into the
valsi that the chunks correspond to, identifying their selma'o and all
that (e.g. "melo" → [<"me" in ME> <"lo" in LE>]). This is before
checking the grammar of the text.

So this hypothetical text editor uses two parsers right now: a chunks-
of-text-to-valsi parser and a sequence-of-valsi-to-textual-structures
parser.

Let's also say that, hypothetically, in testing this text editor, that
this person encountered a problem.

The hypothetical text editor becomes slower and slower when the text
grows in size. This is because, unfortunately, the entire text has to
be parsed whenever a new word is added or existing text is deleted.

What to do? The person hypothetically comes up with an idea! There
could be a *third* parser between the already existing two parsers,
one that converts sequences of valsi into unparsed utterances! The
third parser would ignore everything except I, NIhO, LU, LIhU, TO,
TOI, TUhE, and TUhU, using those selma'o to create a tree of unparsed
utterances.

For instance, the third parser would convert the sequence of valsi [i
cusku lu klama i klama li'u to mi cusku toi i cusku] into [[i cusku lu
[[klama] [i klama]] li'u to [mi cusku] toi] [i cusku]].

Therefore, with this new parser, the hypothetical editor can keep
track of what the boundaries of the utterance *currently being edited*
is, and re-parse *only the current utterance* when it's edited.

But then, the person finds a problem with that solution! A fatal flaw:
*LIhU, TOI, and TUhE are elidable*.

Because of that, it seems that it's impossible to isolate an utterance
from the text it is in without parsing the whole text for complete
grammar.

That's the end of the hypothetical situation. My questions are as
following:

* Is it true that the fact that LIhU, TOI, and TUhE are elidable makes
isolating an utterance impossible without completely parsing the text
the utterance is in? (Just making sure.)
* Should the person make the third parser anyway while making LIhU,
TOI, and TUhE *required and non-elidable*?
* Is there another practical solution for the editor?

Remember, the problem is that the hypothetical text editor is getting
slow because otherwise it needs to parse the entire text for every
edit.

.alyn.post.

unread,
Oct 14, 2010, 7:42:21 PM10/14/10
to loj...@googlegroups.com

My first thought, in reading this, is that what you really want is
to have the state of the program (parse tree and token stream) saved
at the place right before the place the person makes the edit.

Then, after the edit is made, you restore the program to the state
it was right before it parsed the text just edited, and continue
parsing, picking up the newly edited text. This is called a
continuation.[1]

Of course, you don't know where the user is going to be editing, so
you have to save multiple continuations during the parse, and restart
whichever one is closest and earlier than the edit, and throw away
the continuations that are now invalid because they occured later in
the text.

This doesn't have to be an excessive number of continuations. You only
have to save as many continuations as you need to make your program fast
enough. You should have some idea about how much text is "too much" for
performance--how much text makes your parser too slow. So you can
create continuations every N bytes knowing you'll only have to parse M
bytes (where M<=N) after any edit.

This saves you the work of having three different parsers in
exchange for whatever overhead the continuation has. That is going to
depend on other factors in your design, but you'd essentially be
trading memory for speed.

Edits early in the document are of course going to need to reparse
everything after the edit, but:

* you'll parse the most recently changed text first, and you can
provide feedback on this before the parse fully finishes.
* Once you get to the next N boundary in your text (accounting for
additions and deletions made, of course), you can compare the results
from the current parse to the parse tree you got last time, and decide
to stop once they start being identical.

I can't really imagine implementing this in any other way, but I'm
curious to read everyone else's ideas.

1: http://en.wikipedia.org/wiki/Continuation

-Alan
--
.i ko djuno fi le do sevzi

Jonathan Jones

unread,
Oct 14, 2010, 7:53:39 PM10/14/10
to loj...@googlegroups.com

I'm not entirely sure what enables those to be elided, but I believe that such cases are rare, like only-at-the-end-of-text rare. Also, there are various people, me, .xorxes., possibly others I don't know, who feel that they should /never/ be elidable anyway.

Based on that, and the fact that it's expected the user is going to be typing more, it's reasonable to assume for the sake of as-you-type parsing, they aren't elided if they aren't in the text, as it's assumed that the end of current input is not the end of text.

In something like {lu ko'a broda to brodi ko'e li'u}, the {li'u} marks the end of the quoted text, so you'd have to allow for that....
 
* Should the person make the third parser anyway while making LIhU,
TOI, and TUhE *required and non-elidable*?

I say yes, but since that's not official, I should say no. Then again, if the third parser /assumes/ non-elidability, I doubt it will cause problems.

Alternatively, you can cause the third parser to assume current-end-of-input is always equal to terminate-everything-unterminated, and that should work out fine.
 
* Is there another practical solution for the editor?

.alyn.'s idea sounds pretty good to me.
 
Remember, the problem is that the hypothetical text editor is getting
slow because otherwise it needs to parse the entire text for every
edit.

Something tells me this "hypothetical" parser isn't very hypothetical. :D

--
mu'o mi'e .aionys.

.i.a'o.e'e ko cmima le bende pe lo pilno be denpa bu .i doi.luk. mi patfu do zo'o
(Come to the Dot Side! Luke, I am your father. :D )

Adam Lopresto

unread,
Oct 15, 2010, 12:40:06 PM10/15/10
to loj...@googlegroups.com
>> * Is it true that the fact that LIhU, TOI, and TUhE are elidable makes
>> isolating an utterance impossible without completely parsing the text
>> the utterance is in? (Just making sure.)
>
> I'm not entirely sure what enables those to be elided, but I believe that
> such cases are rare, like only-at-the-end-of-text rare. Also, there are
> various people, me, .xorxes., possibly others I don't know, who feel that
> they should /never/ be elidable anyway.

Those are elidable for exactly the reason that every other terminator
in the language is elidable, and in exactly the same way. The only
usual thing is that those can include {.i} inside them, while most
others cannot. (LEhU and ZOI are the other considerations; not also ZO
and the rest of the Magic Words).

.i mi lu mi prami do cusku

Is completely grammatical text, and parses exactly as though a {li'u}
had been included between {do} and {cusku}. You may not like the
style, but I assert that that is only because you have not
internalized the grammar.

Nonetheless, it's probably legitimate to assume that those cases are
rare. Particularly, it seems completely fair (hypothetically) to make
a parser that exhibits sub-optimal performance in those unusual cases
(reparsing all of the above bridi, instead of just the {mi prami do}
part, for instance).

The continuations approach feels more right in general, though.

>
> Based on that, and the fact that it's expected the user is going to be
> typing more, it's reasonable to assume for the sake of as-you-type parsing,
> they aren't elided if they aren't in the text, as it's assumed that the end
> of current input is not the end of text.
>
> In something like {lu ko'a broda to brodi ko'e li'u}, the {li'u} marks the
> end of the quoted text, so you'd have to allow for that....
>
>>
>> * Should the person make the third parser anyway while making LIhU,
>> TOI, and TUhE *required and non-elidable*?
>
> I say yes, but since that's not official, I should say no. Then again, if
> the third parser /assumes/ non-elidability, I doubt it will cause problems.
>
> Alternatively, you can cause the third parser to assume current-end-of-input
> is always equal to terminate-everything-unterminated, and that should work
> out fine.
>
>>
>> * Is there another practical solution for the editor?
>
> .alyn.'s idea sounds pretty good to me.
>
>>
>> Remember, the problem is that the hypothetical text editor is getting
>> slow because otherwise it needs to parse the entire text for every
>> edit.
>
> Something tells me this "hypothetical" parser isn't very hypothetical. :D
>
> --
> mu'o mi'e .aionys.
>
> .i.a'o.e'e ko cmima le bende pe lo pilno be denpa bu .i doi.luk. mi patfu do
> zo'o
> (Come to the Dot Side! Luke, I am your father. :D )
>

> --
> You received this message because you are subscribed to the Google Groups
> "lojban" group.
> To post to this group, send email to loj...@googlegroups.com.
> To unsubscribe from this group, send email to
> lojban+un...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/lojban?hl=en.
>

Luke Bergen

unread,
Oct 15, 2010, 2:13:49 PM10/15/10
to loj...@googlegroups.com
.i mi lu mi prami do cusku

Is completely grammatical text, and parses exactly as though a {li'u}
had been included between {do} and {cusku}.

That's bizarre.  I didn't realize that {lu} was that permissive.  I just assumed that it would suck up everything up till you hit {li'u} and if doing so hit something ungrammatical (like {mi prami do cusku}) then it would be ungrammatical and you just should have used {lo'u}.

Well that's one that I'll have to keep in mind in the future.

Stela Selckiku

unread,
Oct 15, 2010, 8:00:48 PM10/15/10
to loj...@googlegroups.com
On Fri, Oct 15, 2010 at 12:40 PM, Adam Lopresto <adamlo...@gmail.com> wrote:
>
> .i mi lu mi prami do cusku

The ONLY usage I can recall seeing of this type of sentence is when
experts are showing off their mastery of the obscurities! I can parse
them fine, ever since being put through that rite of passage myself,
but I doubt the language would be much poorer without them. OTOH
they'd be very useful on the forum I've been considering making
(perhaps called "lujyjbo") for intentionally convoluted and obscure
Lojban! I thought it could be fun sport if we gathered somewhere far
away from the confused newbies and made Lojban dance with strange
elisions and deep embedding and the layered SE shell game, etc. Even
for that purpose, though, I'm not sure there's anything especially
worthwhile about this quirk! It's not like Lojban has any shortage of
obscurations to play with.

mi'e .telselkik. mu'o

Luke Bergen

unread,
Oct 15, 2010, 11:50:47 PM10/15/10
to loj...@googlegroups.com
oh wow, that sounds like a potentially really fun forum.

And what is this "rite of passage" of which you speak, and how would one go about attempting to survive said rite?


--

Stela Selckiku

unread,
Oct 16, 2010, 1:34:06 AM10/16/10
to loj...@googlegroups.com
On Fri, Oct 15, 2010 at 11:50 PM, Luke Bergen <lukea...@gmail.com> wrote:
> oh wow, that sounds like a potentially really fun
> forum.

http://groups.google.com/group/lujyjbo <-- there it is

> And what is this "rite of passage" of which you
> speak, and how would one go about attempting
> to survive said rite?

I just mean when someone who knows that li'u & toi are elidable says
sentences that elide them, just to watch someone who doesn't know that
yet be confused! I think you've just survived it. Now you can go
subject new people to the knowledge. But, um, maybe you shouldn't!

Jonathan Jones

unread,
Oct 16, 2010, 2:02:11 AM10/16/10
to loj...@googlegroups.com
Fortunately for my sanity, I have no interest whatsoever in that group.

Still, as just a simple example of annoying practices (you call it amusing art, I call it annoying practice, agree to disagree, Stela), someone should post in .lokadin.'s evil no space writing style.

.imiCUSkubaula.lojban. , to provide a rather simple example, although to truly do it justice, a more complex jufra should be used.

symuyn

unread,
Oct 16, 2010, 12:46:00 PM10/16/10
to lojban
On Oct 14, 4:42 pm, ".alyn.post." <alyn.p...@lodockikumazvati.org>
wrote:
Thanks for the replies, everyone.

***

In reply to Mr. Post, saving and using continuations is a very
interesting idea, but unfortunately, I don't see how it would be
practically usable when it comes to editing near the beginning—or even
middle!—of the document. Hypothetically, if you have a long document,
editing it even in the middle would take a long time to process for
each re-parse.

The two points that you give at the end to ameliorate continuations'
problems are interesting but very difficult, as far as I can tell.
Perhaps you can give some answers—

Providing feedback during parsing of text downstream of the editing is
impossible as far as I can tell—every PEG library I know—including the
ones that I've written—is a sealed black box: once you plug something
in, you must wait until it finishes getting the result out.

Comparing parse trees and stopping re-parsing when they're
sufficiently similar is risky, if there is no way to guarantee that
the syntax tree is exactly the same all the way to the end *without re-
parsing the whole thing anyway*. As far as I can tell, just because a
new parse tree starts to look similar to the original tree, the new
parse tree is not necessarily identical till the end. (Or is that
actually a property of the Lojban grammar? If it is, only then should
early stopping by comparison be used.)

However, continuations *are* indeed the only way that I can now think
of to implement practically such a text editor *without* requiring
LIhU, etc. to be at the end of multi-utterance texts.

***

In reply to Mr. Jones, I'd love to hear from xorxes, and other people,
if eliding LIhU, etc. is "looked down upon". That would definitely
shift my deciding toward prohibiting eliding LIhU, etc. But, if LIhU,
etc. are elided *a lot*, then the text editor ought to be able to
handle that.

The hypothetical text editor really is hypothetical. I have a
rudimentary Lojban parser written in a parser library that I wrote
(and need to finish, bleh), but I am currently still planning it.

***

In reply to Mr. Lopresto, my earlier second question doesn't dispute
that eliding LIhU, etc. is grammatically valid. The second question
asks if, hypothetically, it is *worth* the sacrifice to make them non-
elidable to be able to isolate utterances and re-parse only them
during editing.

You gave an interesting suggestion: "it seems completely fair
(hypothetically) to make a parser that exhibits sub-optimal
performance in those unusual cases (reparsing all of the above bridi,
instead of just the {mi prami do} part, for instance)." Unfortunately,
as far as I can tell it is not practically possible to be able to
parse with elidable LIhU, etc. and have sub-optimal performance in
*only* those unusual cases. The person making the hypothetical text
editor can't switch between a fast parser that can't elide them to a
slow parser that can, because in order to detect when LIhU, etc. has
been elided, you need to parse with the slow parser anyway!

***

I'm still struggling over if continuations vs. non-elidable text
terminators are worth it. Lojbanic robustness vs. performance
robustness.

Using continuations—the only apparent way—would make the text editor a
lot more complex. But, is it worth making a text editor *at all* if it
can't parse *all* Lojban, including Lojban that validly elides LIhU,
etc.?

I just don't know.

(There is a third option, of course: not implementing LU…LIhU, TU…
TUhE, etc. at all! But that's definitely unacceptable.)

Jorge Llambías

unread,
Oct 16, 2010, 1:20:02 PM10/16/10
to loj...@googlegroups.com
On Sat, Oct 16, 2010 at 1:46 PM, symuyn <rbys...@gmail.com> wrote:
>
> In reply to Mr. Jones, I'd love to hear from xorxes, and other people,
> if eliding LIhU, etc. is "looked down upon". That would definitely
> shift my deciding toward prohibiting eliding LIhU, etc. But, if LIhU,
> etc. are elided *a lot*, then the text editor ought to be able to
> handle that.

They are not elided a lot, if only because they need very special
contexts in order to be elidable. I would make them non-elidable if it
was my choice.

> But, is it worth making a text editor *at all* if it
> can't parse *all* Lojban, including Lojban that validly elides LIhU,
> etc.?

I think so, yes. But, there may be another problem you may also have
to deal with. Things like "zo ni'o" and ".i bu" are not sentence
separators. So, will they be handled by the lexer? But not every "zo
ni'o" sequence is a sumti. In "zo zo ni'o", the "ni'o" does start a
new paragraph. You may have to parse the whole thing anyway before you
can tell whether a given "ni'o", ".i", "lu", etc. is "active" or not.

mu'o mi'e xorxes

.alyn.post.

unread,
Oct 16, 2010, 2:20:23 PM10/16/10
to loj...@googlegroups.com
On Sat, Oct 16, 2010 at 12:02:11AM -0600, Jonathan Jones wrote:
> .imiCUSkubaula.lojban. , to provide a rather simple example, although to
> truly do it justice, a more complex jufra should be used.
>

I'm sorry to report, after reviewing Chapter 3 of the CLL a couple
days ago, I could read that with little trouble. :-(

Krzysztof Sobolewski

unread,
Oct 16, 2010, 2:38:13 PM10/16/10
to loj...@googlegroups.com
Dnia sobota, 16 października 2010 o 08:02:11 Jonathan Jones napisał(a):

> Still, as just a simple example of annoying practices (you call it amusing art, I call it annoying practice, agree to disagree, Stela), someone should post in .lokadin.'s evil no space writing style.
>
> .imiCUSkubaula.lojban. , to provide a rather simple example, although to truly do it justice, a more complex jufra should be used.

But but but - isn't it how you are suppoed to understand Lojban from hearing? A constant stream of phonemes, some stressed, some not? Does that mean that speaking Lojban is an annoying pratice? ;)
--
Ecce Jezuch
"In the darkest hole, you'd be well advised
Not to plan my funeral before the body dies" - J. Cantrell

signature.asc

.alyn.post.

unread,
Oct 16, 2010, 2:57:01 PM10/16/10
to loj...@googlegroups.com
On Sat, Oct 16, 2010 at 09:46:00AM -0700, symuyn wrote:
> In reply to Mr. Post, saving and using continuations is a very
> interesting idea, but unfortunately, I don't see how it would be
> practically usable when it comes to editing near the beginning—or even
> middle!—of the document. Hypothetically, if you have a long document,
> editing it even in the middle would take a long time to process for
> each re-parse.
>
> The two points that you give at the end to ameliorate continuations'
> problems are interesting but very difficult, as far as I can tell.
> Perhaps you can give some answers—
>
> Providing feedback during parsing of text downstream of the editing is
> impossible as far as I can tell—every PEG library I know—including the
> ones that I've written—is a sealed black box: once you plug something
> in, you must wait until it finishes getting the result out.
>

The PEG parser I'm using requires you to write a generator for token
input, which is the first place I'd try putting continuations:

http://gazette.call-cc.org/issues/5.html

Meaning, I'd manage my continuations on the input side, rather than
the output side, because you're right--stuff pops out fully formed.

I suspect I'd have to hack at the parser some after this to make
continuations work as I expected, but I already expected I'd be
learning and fixing this particular parser anyway.

If you can save to and seek back to the input position for a
particular parse, save the state variables of the parser, and save
the syntax tree, it really should be possible.

While I have extensive experience with parsers, I have very little
experience with PEG parsers, so I accept that something about these
parsers may make that difficult. I wouldn't try to do something
like this with a recursive descent parser, because it saves state on
the stack. I might be motivated enough to convert a recursive descent
parser into a continuation passing style parser, which would allow
me to save the stack-based representation of the parse on the heap
if I needed to create a continuation.


> Comparing parse trees and stopping re-parsing when they're
> sufficiently similar is risky, if there is no way to guarantee that
> the syntax tree is exactly the same all the way to the end *without re-
> parsing the whole thing anyway*. As far as I can tell, just because a
> new parse tree starts to look similar to the original tree, the new
> parse tree is not necessarily identical till the end. (Or is that
> actually a property of the Lojban grammar? If it is, only then should
> early stopping by comparison be used.)
>

You're correct. Sufficiently similar is a heuristic that won't work
for all cases. I was suggesting that this trade-off was better than
the other suggested trade-offs for solving this problem. As you're
the one writing this thing, you get to decide which trade-off you
want to deal with. ;-)

What I was thinking is the the parser would run as a separate
thread, and the parse tree in the main thread would contain in it a
marshall object at the current parse location. Occassionally this
marshall object would receive more of the parse tree and a new
marshall object, replacing the old marshall object with the new bit
of parse tree and the place it was still working.

I wasn't thinking of the "all-or-nothing" property of PEG parsers,
as this idea does require the PEG parser to check back in with the
caller from time to time.


> However, continuations *are* indeed the only way that I can now think
> of to implement practically such a text editor *without* requiring
> LIhU, etc. to be at the end of multi-utterance texts.
>

I don't personally elide LIhU, LUhU, TOI, &c myself.

I know that Eclipse can parse python (and presumably other languages)
in its editor. I also know that this parser is not identical to the
way that python parsers itself, as my coworkers have complained to
me that Eclipse flagged a perfectly valid python construct as
invalid when in fact python did something quite useful with the same
construct.

This doesn't stop people from using Eclipse to write python, and
serves as prior art on how people cope (and programs don't cope) with
this sort of thing.

In our case we add little tokens in the code that Eclipse recognizes
but python doesn't, so Eclipse will shut up and python will work.

I am curious to hear how this project goes for you. What platform
are you targeting? What language are you using? What PEG parser
are you trying? No one has tried doing anything like this with
Lojban before, which makes me quite excited.

Jonathan Jones

unread,
Oct 16, 2010, 5:59:18 PM10/16/10
to loj...@googlegroups.com
I did say it was a rather simple example....

Jonathan Jones

unread,
Oct 16, 2010, 6:00:19 PM10/16/10
to loj...@googlegroups.com
That was lokadin's argument, IIRC. My answer is no, but writing conventions and speech are not the same thing.

Jonathan Jones

unread,
Oct 16, 2010, 6:04:21 PM10/16/10
to loj...@googlegroups.com
Does it have to use PEG? .camxes. uses a RATS! parser, maybe that is less of a "black box" and would work better for this? I'm only guessing, as I don't know thing 2 about parsers in general.

.alyn.post.

unread,
Oct 16, 2010, 6:53:51 PM10/16/10
to loj...@googlegroups.com
On Sat, Oct 16, 2010 at 04:04:21PM -0600, Jonathan Jones wrote:
> Does it have to use PEG? .camxes. uses a RATS! parser, maybe that is less
> of a "black box" and would work better for this? I'm only guessing, as I
> don't know thing 2 about parsers in general.
>

I believe RATS! is a particular implementation of PEG parsing. So
PEG parsing is a parsing technique, and Rats! is one of the
implementations of that technique.

Robin talks about this issue as it pertains to Lojban with some depth
here:

http://www.digitalkingdom.org/~rlpowell/hobbies/lojban/grammar/

Bryan Ford's Master's Thesis on Packrat Parsers is suprisingly
accessible, including what makes PG (Parsing Grammars) more
expressive than CFG's (Context-free Grammars):

http://pdos.csail.mit.edu/~baford/packrat/thesis/

He does make the assumption you're familiar with parsers in general.
The Abstract on that page is a great summary of what trade-offs you
make with packrat parsers vs Yacc-style parsers.

The book I learned parsing from is the "Dragon Book": _Compilers:
Principles, Techniques, and Tools_. There may be a better book these
days, but I won't be surprised if there isn't.

Jonathan Jones

unread,
Oct 16, 2010, 7:13:08 PM10/16/10
to loj...@googlegroups.com
Ah well. Can't say I didn't try.
Reply all
Reply to author
Forward
0 new messages