zoi bug in camxes?

.alyn.post.

unread,

Jan 24, 2011, 11:25:04 AM1/24/11

to Lojban List

Robin,

According to example 10.3 in the CLL, the following statement is not
grammatical when written[1]:

mi djuno fi le valsi po'u zoi gy. gyrations .gy.

camxes parses this anyway. Is this a bug in camxes? Should zoi
handling be detecting whether the zoi-open word is a substring of
zoi-word rather than testing for strict equality?

-Alan

1: http://dag.github.com/cll/19/10/
--
.i ko djuno fi le do sevzi

Jorge Llambías

unread,

Jan 24, 2011, 5:47:27 PM1/24/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 1:25 PM, .alyn.post.
<alyn...@lodockikumazvati.org> wrote:
>
> According to example 10.3 in the CLL, the following statement is not
> grammatical when written[1]:
>
> mi djuno fi le valsi po'u zoi gy. gyrations .gy.
>
> camxes parses this anyway. Is this a bug in camxes?

If it's a bug, it's intentional. camxes will break the enclosed text
into words as much as possible, and "gyrations" is a valid cmevla and
cannot break into the two words "gy rations", so there is no way that
the "gy" of "gyrations" can be taken as the closing ZOI delimiter
word. In that position "gy" is not a possible word.

If the example had been for example "zoi gy gyrate gy", then yes it
would break, because "gyrate" is three cmavo "gy ra te". That would be
a better example for what CLL is saying there.

> Should zoi
> handling be detecting whether the zoi-open word is a substring of
> zoi-word rather than testing for strict equality?

The delimiters are supposed to be valid Lojban words, not just any
phoneme string.

mu'o mi'e xorxes

Robin Lee Powell

unread,

Jan 24, 2011, 8:00:06 PM1/24/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 09:25:04AM -0700, .alyn.post. wrote:
> Robin,
>
> According to example 10.3 in the CLL, the following statement is not
> grammatical when written[1]:
>
> mi djuno fi le valsi po'u zoi gy. gyrations .gy.

I disagree with the CLL here; there is no pause or space, so I don't
see why this should fail.

-Robin

--
http://singinst.org/ : Our last, best hope for a fantastic future.
Lojban (http://www.lojban.org/): The language in which "this parrot
is dead" is "ti poi spitaki cu morsi", but "this sentence is false"
is "na nei". My personal page: http://www.digitalkingdom.org/rlp/

Robin Lee Powell

unread,

Jan 24, 2011, 8:03:10 PM1/24/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 07:47:27PM -0300, Jorge Llamb�as wrote:
> If the example had been for example "zoi gy gyrate gy", then yes
> it would break, because "gyrate" is three cmavo "gy ra te". That
> would be a better example for what CLL is saying there.

Huh.

zoi gy gyrate gy fails in camxes; that seems like a bug (in camxes)
to me. It seems to me that the final zoi delimiter must have a
pause on both ends. But I haven't read the relevant CLL bit in
quite some time; what does it say about that?

Certainly for

zoi gy. gyrations .gy.

to "work" but

zoi gy gyrate gy

to "not work" is a bug in camxes by my standards; it needs to be one
or the other. xorxes is correct as to the *reason* it's like that,
but that doesn't, IMO, make it OK.

Jorge Llambías

unread,

Jan 24, 2011, 9:20:40 PM1/24/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 10:03 PM, Robin Lee Powell
<rlpo...@digitalkingdom.org> wrote:
>
> zoi gy gyrate gy fails in camxes; that seems like a bug (in camxes)
> to me. It seems to me that the final zoi delimiter must have a
> pause on both ends. But I haven't read the relevant CLL bit in
> quite some time; what does it say about that?

CLL: "The cmavo “zoi” (of selma'o ZOI) is a quotation mark for quoting
non-Lojban text. Its syntax is “zoi X. text .X”, where X is a Lojban
word (called the delimiting word) which is separated from the quoted
text by pauses, and which is not found in the written text or spoken
phoneme stream."

It doesn't say that the first X need be preceded by a pause, nor that
the final X need be followed by a pause.

But even the pauses that CLL does mention aren't always needed. For
example camxes probably approves of "zoidadida".

> Certainly for
>
> zoi gy. gyrations .gy.
>
> to "work" but
>
> zoi gy gyrate gy
>
> to "not work" is a bug in camxes by my standards; it needs to be one
> or the other.

Why? From a Lojbanic perspective "gyrations" is a single word, while
"gyrate" are three words, so there doesn't seem to be a reason (unless
you know English, but the Lojban parser doesn't) to treat it as one.

.alyn.post.

unread,

Jan 24, 2011, 10:57:53 PM1/24/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 11:20:40PM -0300, Jorge Llambías wrote:
> On Mon, Jan 24, 2011 at 10:03 PM, Robin Lee Powell
> <rlpo...@digitalkingdom.org> wrote:
> >
> > zoi gy gyrate gy fails in camxes; that seems like a bug (in camxes)
> > to me. It seems to me that the final zoi delimiter must have a
> > pause on both ends. But I haven't read the relevant CLL bit in
> > quite some time; what does it say about that?
>
> CLL: "The cmavo “zoi” (of selma'o ZOI) is a quotation mark for quoting
> non-Lojban text. Its syntax is “zoi X. text .X”, where X is a Lojban
> word (called the delimiting word) which is separated from the quoted
> text by pauses, and which is not found in the written text or spoken
> phoneme stream."
>
> It doesn't say that the first X need be preceded by a pause, nor that
> the final X need be followed by a pause.
>
> But even the pauses that CLL does mention aren't always needed. For
> example camxes probably approves of "zoidadida".
>

I always assumed that this description was describing (in a PEG
grammar with an '=' operator I'm inventing for this purpose):

zoi <- zoi-open=any-lojban-word pause (!(pause? zoi-open) .)* pause zoi-open

Namely, that we read a X as any-lojban-word, store the value, then
we read a *character at a time* until we find another X. In this case
"quoted text" is a character stream, not itself broken into discrete
words and therefor not subject to differentiation between gyrate and
gyration.

I believe this description makes the CLL consistent with itself. It
is the only way I make sense of the example given. I'm not suggesting
this is the behavior the PEG grammar should have, though I certainly
wonder if this is what is being described in the text above.

-Alan

Lindar

unread,

Jan 25, 2011, 1:14:44 AM1/25/11

to loj...@googlegroups.com

This -is- a bug, but not something we can fix. Camxes doesn't differentiate in how the delimited material is pronounced.

Try plugging in {mi kelci la'o vo. Left4Dead .vo.} and it'll break there too. It doesn't know that "gy" in "gyration" is pronounced differently (and therefore doesn't break the rule when spoken) from "gy" as {gy}, and it doesn't read the dots (or strips them before parsing anyway), so it naturally would cut it right there and then see "dead" hanging out there, which is an invalid word.

Robin Lee Powell

unread,

Jan 25, 2011, 2:07:30 AM1/25/11

to loj...@googlegroups.com

That really has nothing at all to do with the point I was making,
which is that it shouldn't be processing stuff inside the zoi quote
at all. "gyrate" or "gyrations" or whatever shouldn't be able to be
matched as the delimiter because "gy" isn't *seperated out as a
word* there. Since we're talking about text and not speaking, this
means: it has no space or . after it.

Robin Lee Powell

unread,

Jan 25, 2011, 2:31:02 AM1/25/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 11:20:40PM -0300, Jorge Llambías wrote:

> On Mon, Jan 24, 2011 at 10:03 PM, Robin Lee Powell
> <rlpo...@digitalkingdom.org> wrote:
> >
> > zoi gy gyrate gy fails in camxes; that seems like a bug (in
> > camxes) to me. It seems to me that the final zoi delimiter must
> > have a pause on both ends. But I haven't read the relevant CLL
> > bit in quite some time; what does it say about that?
>
> CLL: "The cmavo “zoi” (of selma'o ZOI) is a quotation mark for
> quoting non-Lojban text. Its syntax is “zoi X. text .X”, where X
> is a Lojban word (called the delimiting word) which is separated
> from the quoted text by pauses, and which is not found in the
> written text or spoken phoneme stream."
>
> It doesn't say that the first X need be preceded by a pause, nor
> that the final X need be followed by a pause.
>
> But even the pauses that CLL does mention aren't always needed.
> For example camxes probably approves of "zoidadida".

Yeah, I don't think that's right at all; treating what's inside zoi
as Lojbanic text and breaking it up into Lojbanic words just seems a
bad plan to me. Fraught with peril.

Hardly a low-hanging fruit, though.

Jorge Llambías

unread,

Jan 25, 2011, 7:01:34 AM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 4:07 AM, Robin Lee Powell
<rlpo...@digitalkingdom.org> wrote:
>
> it shouldn't be processing stuff inside the zoi quote
> at all. "gyrate" or "gyrations" or whatever shouldn't be able to be
> matched as the delimiter because "gy" isn't *seperated out as a
> word* there.

The text inside the quote "zoi gy gy" is the empty string. Following
this with the Lojban words "ra", "te", "gy" makes no difference, there
is no "processing stuff inside the quote" going on here since there is
no stuff inside the quote to process, other than a space.

> Since we're talking about text and not speaking, this
> means: it has no space or . after it.

That's not how Lojban words are defined. Space is always allowed
between written words, but not always required.

Jorge Llambías

unread,

Jan 25, 2011, 7:13:55 AM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 12:57 AM, .alyn.post.
<alyn...@lodockikumazvati.org> wrote:
>>
>> CLL: "The cmavo “zoi” (of selma'o ZOI) is a quotation mark for quoting
>> non-Lojban text. Its syntax is “zoi X. text .X”, where X is a Lojban
>> word (called the delimiting word) which is separated from the quoted
>> text by pauses, and which is not found in the written text or spoken
>> phoneme stream."
>

> I always assumed that this description was describing (in a PEG
> grammar with an '=' operator I'm inventing for this purpose):
>
> zoi <- zoi-open=any-lojban-word pause (!(pause? zoi-open) .)* pause zoi-open
>
> Namely, that we read a X as any-lojban-word, store the value, then
> we read a *character at a time* until we find another X. In this case
> "quoted text" is a character stream, not itself broken into discrete
> words and therefor not subject to differentiation between gyrate and
> gyration.
>
> I believe this description makes the CLL consistent with itself. It
> is the only way I make sense of the example given. I'm not suggesting
> this is the behavior the PEG grammar should have, though I certainly
> wonder if this is what is being described in the text above.

No, that doesn't agree with the CLL requirement of a pause in front of
the second delimiter, because you are disallowing X even in places not
preceded by a pause.

.alyn.post.

unread,

Jan 25, 2011, 7:31:16 AM1/25/11

to loj...@googlegroups.com

Right. zoi-quoted text that contains X not preceded by a pause
is ungrammatical. The opening message in this thread was the
relevent section of the CLL which describes exactly that situation
as being so.

There are two clauses there, the second being "and which is not
found in the written text" The intention of the code above is
that it does enforce a pause in front of the second delimiter
(that is the 'pause' before the final 'zoi-open'), but that it
also doesn't permit the literal string identified by zoi-open
to appear in the intervening text (with our without a pause,
with in order to succeed in matching the ending delimiter,
without to detect the extra CLL requirement that it also not
appear in the quoted text).

To describe it another way: match any Lojban word, then match a
pause, then try to match the same Lojban word you just matched,
moving forward a character at a time until the match succeeds.
Then, since you must, assert that you had a pause in front of
the second delimiter.

It's convoluted, but I think your statement that I'm disallowing
X even in places not preceded by a pause is indeed a CLL
requirement. Both by implication of the example given and because
of the text I specifically quoted above.

I'm not suggesting this behavior is what we want, I still do
believe it is what the CLL describes.

.alyn.post.

unread,

Jan 25, 2011, 8:48:46 AM1/25/11

to loj...@googlegroups.com

On Mon, Jan 24, 2011 at 11:20:40PM -0300, Jorge Llambías wrote:

I might not be able to forgive you, xorxes, for making me download
and read the source code to the official parser. Looking at it, I
a) think we can do better and b) think I better understand why the
CLL is confusingly worded.

In the technical description of the parser, the following statement
is made:

a. If the Lojban word "zoi" (selma'o ZOI) is identified, take the
following Lojban word (which should be end delimited with a pause for
separation from the following non-Lojban text) as an opening delimiter.
Treat all text following that delimiter, until that delimiter recurs
*after a pause*, as grammatically a single token (labelled 'anything_699'
in this grammar). There is no need for processing within this text
except as necessary to find the closing delimiter.

This seems pretty clear-cut to me, but it has almost nothing to do
with the implementation, which contradicts the opening example in
this thead in how it processes anything_699.

(BTW, I'm not clear as to whether a pause is both space and '.', or
whether it is only '.'. Help?)

The implementation is contained in filter.c, in particular the
following lines:

case ZOI_START_MODE:
tok = lex();
if (isEnd(tok)) return tok;
tok->type = any_word_698;
mode = ZOI_STRING_MODE;
delim = tok;
return tok;
case ZOI_STRING_MODE:
result = newtoken();
result->type = anything_699;
for (;;) {
tok = lex();
if (isEnd(tok)) return tok;
if (strcmp(tok->text, delim->text) == 0) break;
tok->type = -1;
add(result, tok);
}
mode = ZOI_END_MODE;
return result;
case ZOI_END_MODE:
/* note: token has already been read */
tok->type = any_word_698;
mode = NORMAL_MODE;
return tok;

If you follow lex(), you find getword(), which is the low-level
tokenizer in the parser. It reads ' ' or '.' delimited strings,
which means it considers "pano" a single token.

As a result, it behaves much like camxes does with gyration, but
I believe it would differ from camxes in parsing "gyrate", which
at this level of processing it would insist on treating as a single
token rather than three Lojban words.

In no case does it go looking for the delimiter inside individual
tokens, a behavior which camxes matches.

The code has the effect of treating everything between the delimiter
words as a single token, but misses edge cases because of the way
the tokenizer works.

-Alan

Robin Lee Powell

unread,

Jan 25, 2011, 12:19:57 PM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 09:01:34AM -0300, Jorge Llamb�as wrote:
> On Tue, Jan 25, 2011 at 4:07 AM, Robin Lee Powell
> <rlpo...@digitalkingdom.org> wrote:
> >
> > it shouldn't be processing stuff inside the zoi quote at all.
> > �"gyrate" or "gyrations" or whatever shouldn't be able to be
> > matched as the delimiter because "gy" isn't *seperated out as a
> > word* there.
>
> The text inside the quote "zoi gy gy" is the empty string.
> Following this with the Lojban words "ra", "te", "gy" makes no
> difference, there is no "processing stuff inside the quote" going
> on here since there is no stuff inside the quote to process, other
> than a space.

I understand what you're saying, I simply disagree that zoi should
work that way; the ending particle should have to be completely
delimited.

I don't actually *care* all that much, it's just my feeling on the
matter.

.alyn.post.

unread,

Jan 25, 2011, 12:36:07 PM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 09:19:57AM -0800, Robin Lee Powell wrote:
> On Tue, Jan 25, 2011 at 09:01:34AM -0300, Jorge Llamb�as wrote:
> > On Tue, Jan 25, 2011 at 4:07 AM, Robin Lee Powell
> > <rlpo...@digitalkingdom.org> wrote:
> > >
> > > it shouldn't be processing stuff inside the zoi quote at all.
> > > �"gyrate" or "gyrations" or whatever shouldn't be able to be
> > > matched as the delimiter because "gy" isn't *seperated out as a
> > > word* there.
> >
> > The text inside the quote "zoi gy gy" is the empty string.
> > Following this with the Lojban words "ra", "te", "gy" makes no
> > difference, there is no "processing stuff inside the quote" going
> > on here since there is no stuff inside the quote to process, other
> > than a space.
>
> I understand what you're saying, I simply disagree that zoi should
> work that way; the ending particle should have to be completely
> delimited.
>
> I don't actually *care* all that much, it's just my feeling on the
> matter.
>

From the conversation, I can summarize three separate proposals:

1) leave the PEG grammar alone and correct the CLL to describe the
way this grammar is behaving.
2) replace the rule for zoi-word to match non-lojban-word rather
than any-word, so 'gyrate' won't be divided into three words,
satisfying Robin's consistency argument. We'll still need to
update the CLL for the behavior, I think.
3) Replace the PEG grammar with something that reads the stuff
between the ZOI delimiter a character at a time. Either require
a pause before the final delimiter or not.

I like 2, 1, and then 3 in that order.

3 I don't like because the rest of the grammar doesn't really work
that way. The grammar is defined by token streams and the
morphology file handles composing the input stream into tokens.
3 Is not completely unprecedented, however, FAhO does something
similar.

2 I like because because of the argument Robin raised about
consistency. I think it is surprising that gyrate is invalid
but gyration is valid, and I don't think that surprise is a useful
feature. I can't think of a reason to dislike 2, please help me
with that.

1 Has the advantage of working that way right now. It also has the
advantage of prefering Lojbanic text to non-Lojbanic text. ZOI
isn't special from that perspective, the parser does what the parse
does ZOI or not.

Will you give me your preference ordering? I'd like to know whether
I should:

a) update the CLL errata.
b) update the PEG grammar.
c) raise an issue with the BPFK to make a decision.

I'll use the preference ordering to make a choice.

Jorge Llambías

unread,

Jan 25, 2011, 5:13:01 PM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 2:36 PM, .alyn.post.
<alyn...@lodockikumazvati.org> wrote:
>
> From the conversation, I can summarize three separate proposals:
>
> 1) leave the PEG grammar alone and correct the CLL to describe the
> way this grammar is behaving.

This is currently my preference.

> 2) replace the rule for zoi-word to match non-lojban-word rather
> than any-word, so 'gyrate' won't be divided into three words,
> satisfying Robin's consistency argument. We'll still need to
> update the CLL for the behavior, I think.

Notice that "gyrate" does NOT match "non-lojban-word" (and neither
does "gyrations" for that matter). You probably mean something else,
some new rule like "string-of-phonemes-without-space <- non-space+".
This also relates to the ZOhOI proposal.

Yes, CLL needs updating in any case. Robin requires the final
delimiter to be followed by a pause, which CLL does not.

> 3) Replace the PEG grammar with something that reads the stuff
> between the ZOI delimiter a character at a time. Either require
> a pause before the final delimiter or not.
>
> I like 2, 1, and then 3 in that order.

1, 2, 3 for me.

> 2 I like because because of the argument Robin raised about
> consistency. I think it is surprising that gyrate is invalid
> but gyration is valid, and I don't think that surprise is a useful
> feature. I can't think of a reason to dislike 2, please help me
> with that.

But wouldn't "zoi mi miklama mi" as a valid quote be surprising? Are
you equally tempted to read "miklama" as a single word as you are to
read "gyrate" as one word, or are you just thinking English?

.alyn.post.

unread,

Jan 25, 2011, 5:14:51 PM1/25/11

to loj...@googlegroups.com

Excellent point, I do see miklama as two Lojban words.

.alyn.post.

unread,

Jan 25, 2011, 5:25:14 PM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 07:13:01PM -0300, Jorge Llamb�as wrote:
> On Tue, Jan 25, 2011 at 2:36 PM, .alyn.post.
> <alyn...@lodockikumazvati.org> wrote:
> >
> > From the conversation, I can summarize three separate proposals:
> >
> > 1) leave the PEG grammar alone and correct the CLL to describe the
> > � way this grammar is behaving.
>
> This is currently my preference.
>
> > 2) replace the rule for zoi-word to match non-lojban-word rather
> > � than any-word, so 'gyrate' won't be divided into three words,
> > � satisfying Robin's consistency argument. �We'll still need to
> > � update the CLL for the behavior, I think.
>
> Notice that "gyrate" does NOT match "non-lojban-word" (and neither
> does "gyrations" for that matter). You probably mean something else,
> some new rule like "string-of-phonemes-without-space <- non-space+".
> This also relates to the ZOhOI proposal.
>

Ah, I'm proposing to remove the !lojban-word from the
non-lojban-word production, as it is redundant -- non-lojban-word is
used in an ordered choice operator for which the preceding rule is
lojban-word, which obviates the need for it in that case. That is
the only reference.

Jorge Llambías

unread,

Jan 25, 2011, 5:31:51 PM1/25/11

to loj...@googlegroups.com

On Tue, Jan 25, 2011 at 7:25 PM, .alyn.post.
<alyn...@lodockikumazvati.org> wrote:
>
> Ah, I'm proposing to remove the !lojban-word from the
> non-lojban-word production, as it is redundant -- non-lojban-word is
> used in an ordered choice operator for which the preceding rule is
> lojban-word, which obviates the need for it in that case. That is
> the only reference.

In that case, I suggest changing the name of the rule to something
more descriptive. I find it confusing to think of "mi", "klama" and
"miklama" as "non-lojban-word"s.

.alyn.post.

unread,

Jan 26, 2011, 9:23:11 PM1/26/11

to Lojban List

On Mon, Jan 24, 2011 at 09:25:04AM -0700, .alyn.post. wrote:

I've added a BPFK section asking for resolution on this issues:

http://www.lojban.org/tiki/tiki-index.php?page=BPFK+Section%3A+ZOI

-Alan

Reply all

Reply to author

Forward