According to example 10.3 in the CLL, the following statement is not
grammatical when written[1]:
mi djuno fi le valsi po'u zoi gy. gyrations .gy.
camxes parses this anyway. Is this a bug in camxes? Should zoi
handling be detecting whether the zoi-open word is a substring of
zoi-word rather than testing for strict equality?
-Alan
1: http://dag.github.com/cll/19/10/
--
.i ko djuno fi le do sevzi
If it's a bug, it's intentional. camxes will break the enclosed text
into words as much as possible, and "gyrations" is a valid cmevla and
cannot break into the two words "gy rations", so there is no way that
the "gy" of "gyrations" can be taken as the closing ZOI delimiter
word. In that position "gy" is not a possible word.
If the example had been for example "zoi gy gyrate gy", then yes it
would break, because "gyrate" is three cmavo "gy ra te". That would be
a better example for what CLL is saying there.
> Should zoi
> handling be detecting whether the zoi-open word is a substring of
> zoi-word rather than testing for strict equality?
The delimiters are supposed to be valid Lojban words, not just any
phoneme string.
mu'o mi'e xorxes
I disagree with the CLL here; there is no pause or space, so I don't
see why this should fail.
-Robin
--
http://singinst.org/ : Our last, best hope for a fantastic future.
Lojban (http://www.lojban.org/): The language in which "this parrot
is dead" is "ti poi spitaki cu morsi", but "this sentence is false"
is "na nei". My personal page: http://www.digitalkingdom.org/rlp/
Huh.
zoi gy gyrate gy fails in camxes; that seems like a bug (in camxes)
to me. It seems to me that the final zoi delimiter must have a
pause on both ends. But I haven't read the relevant CLL bit in
quite some time; what does it say about that?
Certainly for
zoi gy. gyrations .gy.
to "work" but
zoi gy gyrate gy
to "not work" is a bug in camxes by my standards; it needs to be one
or the other. xorxes is correct as to the *reason* it's like that,
but that doesn't, IMO, make it OK.
CLL: "The cmavo “zoi” (of selma'o ZOI) is a quotation mark for quoting
non-Lojban text. Its syntax is “zoi X. text .X”, where X is a Lojban
word (called the delimiting word) which is separated from the quoted
text by pauses, and which is not found in the written text or spoken
phoneme stream."
It doesn't say that the first X need be preceded by a pause, nor that
the final X need be followed by a pause.
But even the pauses that CLL does mention aren't always needed. For
example camxes probably approves of "zoidadida".
> Certainly for
>
> zoi gy. gyrations .gy.
>
> to "work" but
>
> zoi gy gyrate gy
>
> to "not work" is a bug in camxes by my standards; it needs to be one
> or the other.
Why? From a Lojbanic perspective "gyrations" is a single word, while
"gyrate" are three words, so there doesn't seem to be a reason (unless
you know English, but the Lojban parser doesn't) to treat it as one.
I always assumed that this description was describing (in a PEG
grammar with an '=' operator I'm inventing for this purpose):
zoi <- zoi-open=any-lojban-word pause (!(pause? zoi-open) .)* pause zoi-open
Namely, that we read a X as any-lojban-word, store the value, then
we read a *character at a time* until we find another X. In this case
"quoted text" is a character stream, not itself broken into discrete
words and therefor not subject to differentiation between gyrate and
gyration.
I believe this description makes the CLL consistent with itself. It
is the only way I make sense of the example given. I'm not suggesting
this is the behavior the PEG grammar should have, though I certainly
wonder if this is what is being described in the text above.
-Alan
That really has nothing at all to do with the point I was making,
which is that it shouldn't be processing stuff inside the zoi quote
at all. "gyrate" or "gyrations" or whatever shouldn't be able to be
matched as the delimiter because "gy" isn't *seperated out as a
word* there. Since we're talking about text and not speaking, this
means: it has no space or . after it.
Yeah, I don't think that's right at all; treating what's inside zoi
as Lojbanic text and breaking it up into Lojbanic words just seems a
bad plan to me. Fraught with peril.
Hardly a low-hanging fruit, though.
The text inside the quote "zoi gy gy" is the empty string. Following
this with the Lojban words "ra", "te", "gy" makes no difference, there
is no "processing stuff inside the quote" going on here since there is
no stuff inside the quote to process, other than a space.
> Since we're talking about text and not speaking, this
> means: it has no space or . after it.
That's not how Lojban words are defined. Space is always allowed
between written words, but not always required.
No, that doesn't agree with the CLL requirement of a pause in front of
the second delimiter, because you are disallowing X even in places not
preceded by a pause.
Right. zoi-quoted text that contains X not preceded by a pause
is ungrammatical. The opening message in this thread was the
relevent section of the CLL which describes exactly that situation
as being so.
There are two clauses there, the second being "and which is not
found in the written text" The intention of the code above is
that it does enforce a pause in front of the second delimiter
(that is the 'pause' before the final 'zoi-open'), but that it
also doesn't permit the literal string identified by zoi-open
to appear in the intervening text (with our without a pause,
with in order to succeed in matching the ending delimiter,
without to detect the extra CLL requirement that it also not
appear in the quoted text).
To describe it another way: match any Lojban word, then match a
pause, then try to match the same Lojban word you just matched,
moving forward a character at a time until the match succeeds.
Then, since you must, assert that you had a pause in front of
the second delimiter.
It's convoluted, but I think your statement that I'm disallowing
X even in places not preceded by a pause is indeed a CLL
requirement. Both by implication of the example given and because
of the text I specifically quoted above.
I'm not suggesting this behavior is what we want, I still do
believe it is what the CLL describes.
I might not be able to forgive you, xorxes, for making me download
and read the source code to the official parser. Looking at it, I
a) think we can do better and b) think I better understand why the
CLL is confusingly worded.
In the technical description of the parser, the following statement
is made:
a. If the Lojban word "zoi" (selma'o ZOI) is identified, take the
following Lojban word (which should be end delimited with a pause for
separation from the following non-Lojban text) as an opening delimiter.
Treat all text following that delimiter, until that delimiter recurs
*after a pause*, as grammatically a single token (labelled 'anything_699'
in this grammar). There is no need for processing within this text
except as necessary to find the closing delimiter.
This seems pretty clear-cut to me, but it has almost nothing to do
with the implementation, which contradicts the opening example in
this thead in how it processes anything_699.
(BTW, I'm not clear as to whether a pause is both space and '.', or
whether it is only '.'. Help?)
The implementation is contained in filter.c, in particular the
following lines:
case ZOI_START_MODE:
tok = lex();
if (isEnd(tok)) return tok;
tok->type = any_word_698;
mode = ZOI_STRING_MODE;
delim = tok;
return tok;
case ZOI_STRING_MODE:
result = newtoken();
result->type = anything_699;
for (;;) {
tok = lex();
if (isEnd(tok)) return tok;
if (strcmp(tok->text, delim->text) == 0) break;
tok->type = -1;
add(result, tok);
}
mode = ZOI_END_MODE;
return result;
case ZOI_END_MODE:
/* note: token has already been read */
tok->type = any_word_698;
mode = NORMAL_MODE;
return tok;
If you follow lex(), you find getword(), which is the low-level
tokenizer in the parser. It reads ' ' or '.' delimited strings,
which means it considers "pano" a single token.
As a result, it behaves much like camxes does with gyration, but
I believe it would differ from camxes in parsing "gyrate", which
at this level of processing it would insist on treating as a single
token rather than three Lojban words.
In no case does it go looking for the delimiter inside individual
tokens, a behavior which camxes matches.
The code has the effect of treating everything between the delimiter
words as a single token, but misses edge cases because of the way
the tokenizer works.
-Alan
I understand what you're saying, I simply disagree that zoi should
work that way; the ending particle should have to be completely
delimited.
I don't actually *care* all that much, it's just my feeling on the
matter.
From the conversation, I can summarize three separate proposals:
1) leave the PEG grammar alone and correct the CLL to describe the
way this grammar is behaving.
2) replace the rule for zoi-word to match non-lojban-word rather
than any-word, so 'gyrate' won't be divided into three words,
satisfying Robin's consistency argument. We'll still need to
update the CLL for the behavior, I think.
3) Replace the PEG grammar with something that reads the stuff
between the ZOI delimiter a character at a time. Either require
a pause before the final delimiter or not.
I like 2, 1, and then 3 in that order.
3 I don't like because the rest of the grammar doesn't really work
that way. The grammar is defined by token streams and the
morphology file handles composing the input stream into tokens.
3 Is not completely unprecedented, however, FAhO does something
similar.
2 I like because because of the argument Robin raised about
consistency. I think it is surprising that gyrate is invalid
but gyration is valid, and I don't think that surprise is a useful
feature. I can't think of a reason to dislike 2, please help me
with that.
1 Has the advantage of working that way right now. It also has the
advantage of prefering Lojbanic text to non-Lojbanic text. ZOI
isn't special from that perspective, the parser does what the parse
does ZOI or not.
Will you give me your preference ordering? I'd like to know whether
I should:
a) update the CLL errata.
b) update the PEG grammar.
c) raise an issue with the BPFK to make a decision.
I'll use the preference ordering to make a choice.
This is currently my preference.
> 2) replace the rule for zoi-word to match non-lojban-word rather
> than any-word, so 'gyrate' won't be divided into three words,
> satisfying Robin's consistency argument. We'll still need to
> update the CLL for the behavior, I think.
Notice that "gyrate" does NOT match "non-lojban-word" (and neither
does "gyrations" for that matter). You probably mean something else,
some new rule like "string-of-phonemes-without-space <- non-space+".
This also relates to the ZOhOI proposal.
Yes, CLL needs updating in any case. Robin requires the final
delimiter to be followed by a pause, which CLL does not.
> 3) Replace the PEG grammar with something that reads the stuff
> between the ZOI delimiter a character at a time. Either require
> a pause before the final delimiter or not.
>
> I like 2, 1, and then 3 in that order.
1, 2, 3 for me.
> 2 I like because because of the argument Robin raised about
> consistency. I think it is surprising that gyrate is invalid
> but gyration is valid, and I don't think that surprise is a useful
> feature. I can't think of a reason to dislike 2, please help me
> with that.
But wouldn't "zoi mi miklama mi" as a valid quote be surprising? Are
you equally tempted to read "miklama" as a single word as you are to
read "gyrate" as one word, or are you just thinking English?
Excellent point, I do see miklama as two Lojban words.
Ah, I'm proposing to remove the !lojban-word from the
non-lojban-word production, as it is redundant -- non-lojban-word is
used in an ordered choice operator for which the preceding rule is
lojban-word, which obviates the need for it in that case. That is
the only reference.
In that case, I suggest changing the name of the rule to something
more descriptive. I find it confusing to think of "mi", "klama" and
"miklama" as "non-lojban-word"s.
I've added a BPFK section asking for resolution on this issues:
http://www.lojban.org/tiki/tiki-index.php?page=BPFK+Section%3A+ZOI
-Alan