Spaces in jbovlaste

36 views
Skip to first unread message

suke...@gmail.com

unread,
Jul 27, 2017, 4:49:37 AM7/27/17
to lojban
coi ro do

I found entries with spaces in jbovlaste. This is an issue for spell checking dictionaries (actually in "aspell"). I know that spaces are not relevant when parsing Lojban, but they're still important for human reading. This is why I would not like a rule like "import every entry and remove spaces everywhere"...

So, I understand that it may be normal for compound cmavo, like "tai da'i", but can't these be written without space ("taida'i") without breaking the reading flow?
However, some entries seem very strange to me, such as "re zei zgabube". Aren't these 3 separated words??

Thank you for your explanations.

co'o

-- 
Sukender

Ilmen

unread,
Jul 27, 2017, 7:04:54 AM7/27/17
to loj...@googlegroups.com
If spell checkers are only concerned with identifying what is a correct
word and what isn't, then you should disregard Jbovlaste entries
containing whitespace (they are multi-words lexemes), or even better,
check all the words that compose them to see if any of them is missing
from your spell-check whitelist (I strongly suspect there exists bu and
zei compounds containing words that appears nowhere else in the
dictionary…).

"re zei zgabube" is indeed a sequence of three words. It is present in
the dictionary because it is an independent lexeme, you cannot
accurately derive its meaning from its parts. This occurs all the times
in natlangs, think for example to the English "take off".

As for cmavo sequences, people are allowed to chain them up without
whitespaces in between (this causes no ambiguity), although nowadays it
seems more common to always separate them with whitespaces. For a
spell-checker, two strategy are possible: the lazy one would be to
enforce the style of putting whitespaces between every cmavo, thus
marking e.g. "lonu" as incorrect; the second strategy, more involved,
would be to check any unknown letter string to see if it matchs a
sequence of cmavo, and allow it if it does (e.g. if the program hits
"calonu" and is able to find it can be a sequence of cmavo ca+lo+nu,
only then it would allow it). But I don't know if the software you're
using is able to do that without an explicit and systematic list of all
allowable cmavo strings…

If the software were to need an explicit and exhaustive list of allowed
words, I guess it wouldn't be very handy to use for very synthetic
languages (e.g. Turkish, Quechua, Greenlandic…), which might have an
infinite number of valid words.

—Ilmen.

suke...@gmail.com

unread,
Jul 27, 2017, 9:43:44 AM7/27/17
to lojban
Le jeudi 27 juillet 2017 13:04:54 UTC+2, Ilmen a écrit :
If spell checkers are only concerned with identifying what is a correct
word and what isn't,

Exactly! For now, my first concern is to get a first step towards spell/grammar checking for common software (see the other thread). That's clearly a "better than nothing" idea... and yes, it's clearly sub-optimal.
 
then you should disregard Jbovlaste entries
containing whitespace (they are multi-words lexemes), or even better,
check all the words that compose them to see if any of them is missing
from your spell-check whitelist (I strongly suspect there exists bu and
zei compounds containing words that appears nowhere else in the
dictionary…).

Great! I'll do that. Thanks.
 
"re zei zgabube" is indeed a sequence of three words. It is present in
the dictionary because it is an independent lexeme, you cannot
accurately derive its meaning from its parts. This occurs all the times
in natlangs, think for example to the English "take off".

Okay. But as you mentioned, spell checkers only check spelling! So in the English ones, "take" and "off" are separated. The grammar checker, however, should detect the meaning of "take off" instead of "take" and "off" separately.
 
As for cmavo sequences, people are allowed to chain them up without
whitespaces in between (this causes no ambiguity), although nowadays it
seems more common to always separate them with whitespaces. For a
spell-checker, two strategy are possible: the lazy one would be to
enforce the style of putting whitespaces between every cmavo, thus
marking e.g. "lonu" as incorrect; the second strategy, more involved,
would be to check any unknown letter string to see if it matchs a
sequence of cmavo, and allow it if it does (e.g. if the program hits
"calonu" and is able to find it can be a sequence of cmavo ca+lo+nu,
only then it would allow it). But I don't know if the software you're
using is able to do that without an explicit and systematic list of all
allowable cmavo strings…

You're right. I guess I'll insert both "split" and "merged" jbovlaste entries ("tai da'i" and "taida'i"). But as long as the reference doesn't exhibit ALL possible combinations ("ca lo no", "ca lonu", "calonu", etc.), and as long as there are no subtle rules about generating "affixes" (ie. compounds words generation for spell checkers), then it would be hard being precise.

I'll start with a very basic spell checker and maybe add rules later on... if there are enough people willing to help! I'm clearly too few experienced in Lojban to easily find the rules which are the "most important". Do you think about a few rules that could be integrated?
I guess that the rule "a cmavo can follow a cmavo as suffix" could be nice, but I don't know how to implement it. I'm currently struggling with https://www.systutorials.com/docs/linux/man/4-hunspell/#lbAI

If the software were to need an explicit and exhaustive list of allowed
words, I guess it wouldn't be very handy to use for very synthetic
languages (e.g. Turkish, Quechua, Greenlandic…), which might have an
infinite number of valid words.

Well, that's the "affix" stuff I just wrote about. I don't know anything about those languages, but surely they have "good" affix/replacement rules in their dictionaries.
 
Anyway, thank you very much for clarification.

-- 
Sukender

suke...@gmail.com

unread,
Jul 28, 2017, 7:54:35 AM7/28/17
to lojban
coi la .ilmen.

I just applied your idea (added split entries) and added merged entries... And I also found a very simple way to add compound cmavo!
Indeed:
  • I created a script that splits jbovlaste entries into cmavo and non-cmavo, by using a simple regex (using rules listed in the CLL, chapter 4.2)
  • Then I tagged all cmavo with a flag "C", and added the Hunspell rule "CCC*" (~= "CC+"), which means you can "glue" 2 or more cmavo together.
Of course, this will allow un-grammatical things such as "lonulonucalo", but once again this is not the spell-checker role.

I tried your example "calonu". It seems the "lonu" entry exists, so my dictionary inteprets that as a "normal word" (= non-simple-cmavo) instead of a "compound cmavo". But all following combinations are now valid :
  • ca, lo, nu
  • lo nu, lonu, ca lo, calo
  • ca lonu, calo nu, calonu
Only calo & calonu are detected as a compound (remember "lonu" is an entry), but anyway that works as expected.
Experimental cmavo support will be added soon.

Do you know other rules that could be great integrating?
Please test ( https://github.com/Sukender/lojban-spell-check-dist ) and give feedback! ki'e

I still have issues with dots in LibreOffice (.i .a and such)... And some words of "le cmalu noltru" are not recognized yet. Is there any other word source I can use?

co'o

-- 
Sukender

Adam Lopresto

unread,
Jul 28, 2017, 11:33:26 AM7/28/17
to lojban
If you're going to allow cmavo to be combined arbitrarily (which is probably appropriate), then there's no reason for {lonu} to have its own entry. So I'd suggest not adding any cmavo clusters.

And {lonulonucalo} can be grammatical, you just need the right text after it. {lonulonucalo nu jamna kei mi damba cu nandu mi cu se zungi mi}, "I feel guilty that it was hard for me to fight during the war." As you said, a fully grammar checker would be needed to really get things right, and that's a separate problem.

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+un...@googlegroups.com.
To post to this group, send email to loj...@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.

Sukender

unread,
Jul 28, 2017, 11:43:47 AM7/28/17
to loj...@googlegroups.com
Tanks for the clarification. I didn't even imagine that this big random compound cmavo would be valid! You made my evening! ;-)

About {lonu} entry, I clearly agree. But I can't filter all them out... Or can I? If you get any (simple) idea of rule for that, then go ahead!

By the way, I already filtered out a few words. I indeed found some of huge length (even a weird one about Macarena!). As it may be spam, I added an arbitrary rule that throws away all that have more than 22 characters. Maybe a finer rule has to be found...

Cheers,

--
Sukender

Adam Lopresto

unread,
Jul 28, 2017, 11:51:56 AM7/28/17
to loj...@googlegroups.com
jbovlaste should already be filtered to contain only Lojban, and there are, broadly, three types of Lojban words:
cmevla are everything that ends in a consonant
brivla all contain a consonant cluster and end in a vowel
cmavo optionally start with a single consonant, and consist entirely of vowels and apostrophes after that.

So, I think you could filter all cmavo clusters by looking for anything that matches /.+[^aeiou'].*[aeiou]/ but doesn't match /[^aeiou'][^aeiou']/. Contains a non-vowel somewhere after the first letter, ends in a vowel, and doesn't contain a consonant cluster.

At least, that seems like a good start. 

suke...@gmail.com

unread,
Jul 31, 2017, 6:34:32 AM7/31/17
to lojban
coi la .adam. .i coi ro do

Sorry for the late answer; I was tweaking my scripts & tools according to your advice and according to the CLL.
I did not took the exact regex you proposed, but included your idea. So : "thanks" ! Could you eventually review/check my regexes (see links to scripts below)?

For your information, and based on your idea and Ilmen's idea, I added a 3 step processing:
  1. Clean the input in a generic way : tabs/spaces, split entries with spaces, etc. (see the sed script at this point, or its latest version)
  2. Clean from a "Lojbanic" point of view : remove non-lojban entries, prepend dot before words starting with vowels, etc. (current script / latest version)
  3. Split entries: cmevla, cmavo, compound cmavo, and a few other classes (current script / latest version)
Current results are:
38 "illegal" words, and 430 duplicates (mainly generated by splitting, such as when processing "lo nu", "lo", "nu")

Splitter generates such things (here are a few lines for each, of course):
--- cmavo ---
.a
.a'a
.a'au
.a'e
.a'ei
.ai
.a'i
--- cmavo_compound ---
.a'acu'i
.a'anai
.a'enai
.a'icu'i
.aicu'i
.ainai
.a'inai
--- brivla ---
.a'anmo
.abniena
.abvele
.aclotlu
.adgalagda
.adji
.admine
--- vowel ---
.abu
.ebu
.ibu
--- consonant ---
by
cy
dy
--- cmevla ---
.abata'adj
.abgad
.acaman
.akev
.akrobat
.akuuas
.aleksandras
--- other ---
(empty list)



co'o

-- 
Sukender
Reply all
Reply to author
Forward
0 new messages