Etymology of future gismu (if they are to be created)

gleki

unread,

May 12, 2012, 9:25:33 AM5/12/12

to loj...@googlegroups.com

In recent discussion in "bugs in jbovlaste" Robin suggested that thinking of a bunch of new gismu is worth considering.
If this side wins what will be the source of the sounding of those gismu?
Don't you think that 6 languages are too few?
Let's take English and Spanish. There are much more speakers of those two compared with
Portuguese, French and Italian.
But if we combine the last three with the previous two they might considerably shift the ratings of phonemes and their combinations
when applying the gismu creation algorithm (http://dag.github.com/cll/4/14/)
because many words sound similar in those languages.

So why not expand the algorithm to more than 6 languages and may be even up to 20-30 languages as these days (as opposed to 1987) it's not a problem to find the sounding of the selected gismu concept for every language.

Robin Lee Powell

unread,

May 12, 2012, 1:35:20 PM5/12/12

to loj...@googlegroups.com

On Sat, May 12, 2012 at 06:25:33AM -0700, gleki wrote:
> So why not expand the algorithm to more than 6 languages and may
> be even up to 20-30 languages as these days (as opposed to 1987)
> it's not a problem to find the sounding of the selected gismu
> concept for every language.

It's not? It sure sounds pretty problematic to me. I don't read
Devenagari or Hanzi, for starters.

-Robin

--
http://singinst.org/ : Our last, best hope for a fantastic future.
.i ko na cpedu lo nu stidi vau loi jbopre .i danfu lu na go'i li'u .e
lu go'i li'u .i ji'a go'i lu na'e go'i li'u .e lu go'i na'i li'u .e
lu no'e go'i li'u .e lu to'e go'i li'u .e lu lo mamta be do cu sofybakni li'u

Jonathan Jones

unread,

May 12, 2012, 3:57:05 PM5/12/12

to loj...@googlegroups.com

On Sat, May 12, 2012 at 7:25 AM, gleki <gleki.is...@gmail.com> wrote:

In recent discussion in "bugs in jbovlaste" Robin suggested that thinking of a bunch of new gismu is worth considering.

If this side wins <snip>

There are no sides, and no winning. There will be no creation, destruction, or alteration of gismu until the baseline is finished being documented, after that, it is possible and even likely that all will occur.

--
mu'o mi'e .aionys.

.i.e'ucai ko cmima lo pilno be denpa bu .i doi.luk. mi patfu do zo'o
(Come to the Dot Side! Luke, I am your father. :D )

gleki

unread,

May 13, 2012, 1:02:04 AM5/13/12

to loj...@googlegroups.com

>>It's not? It sure sounds pretty problematic to me. I don't read
Devenagari or Hanzi, for starters.

It's not. Google Translate, other romanization services and audio samples like Forvo do their job pretty well so we can always determine what is the most close lojbanic phoneme to each of the sound in Hindi and Mandarine.

Jonathan Jones

unread,

May 13, 2012, 1:17:23 AM5/13/12

to loj...@googlegroups.com

I know from experience that any and all translation programs are horrid at translation.

Furthermore, I don't see any need to include more languages into the algorithm.

gleki

unread,

May 13, 2012, 2:01:08 AM5/13/12

to loj...@googlegroups.com

>>I know from experience that any and all translation programs are horrid at translation.
>>Furthermore, I don't see any need to include more languages into the algorithm.

Transliteration *may be* horrid indeed (especially in case of Arabic). However, audio recordings can solve this issue.
The algorithm was chosen to make people from all over the world learn words quicker.
If so why limit the number of source languages to 6?
Russian is no longer among first 6. There is probably Bengali or Indonesian instead of it.
And do those 6 languages really represent the majority of the population of the planet?
The most trustworthy answer is the following.
If adding more languages changes the resulting sounding then 6 languages are not enough.

Sid

unread,

May 13, 2012, 7:15:37 AM5/13/12

to loj...@googlegroups.com

Bengali is indeed up there, but it's still a fairly close call with Russian.

mi'e cntr

> --
> You received this message because you are subscribed to the Google Groups "lojban" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/lojban/-/PigADiIPnt4J.
> To post to this group, send email to loj...@googlegroups.com.
> To unsubscribe from this group, send email to lojban+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lojban?hl=en.
>

Pierre Abbat

unread,

May 13, 2012, 8:00:50 AM5/13/12

to loj...@googlegroups.com

On Sunday 13 May 2012 02:01:08 gleki wrote:
> Transliteration *may be* horrid indeed (especially in case of Arabic).
> However, audio recordings can solve this issue. The algorithm was chosen to
> make people from all over the world learn words quicker. If so why limit
> the number of source languages to 6?

Back when they were making the gismu list, they tested various numbers of
source languages. When they tried more than 8, the algorithm behaved like a
random-word generator. 6 is the number they found the best for making words
that are recognizable to a large number of people.

Pierre
--
li ze te'a ci vu'u ci bi'e te'a mu du
li ci su'i ze te'a mu bi'e vu'u ci

gleki

unread,

May 13, 2012, 8:34:06 AM5/13/12

to loj...@googlegroups.com

>>Back when they were making the gismu list, they tested various numbers of
source languages. When they tried more than 8, the algorithm behaved like a
random-word generator. 6 is the number they found the best for making words
that are recognizable to a large number of people.

This disproves JCB's theory (of global etymology that he mentioned in his article in SciAm).
However, this can be easily checked.

Mark somewhere in your notes or browser bookmarks.

*If it comes to creating new gismu ask me (If I'm still around at that time).
I'll give you etymology, sources/dictionaries/audio samples and the scoring results for the concepts that you choose.*

John E. Clifford

unread,

May 13, 2012, 10:19:51 AM5/13/12

to loj...@googlegroups.com

It is worth noting that the claim that this familiarity idea for gismu has never been tested. While there are anecdotal claims of how it helped learning some words, there are about the same number of anecdotes about how it mistaught or slowed the learning of others. In any case, the advantage over random words seems slight and probably not enough to offset the advantages of more controlled word development (equitable spread, useful rafsi, etc.).

Sent from my iPad

> --
> You received this message because you are subscribed to the Google Groups "lojban" group.

> To view this discussion on the web visit https://groups.google.com/d/msg/lojban/-/efem0BkdLSIJ.

Robert LeChevalier

unread,

May 14, 2012, 5:03:39 AM5/14/12

to loj...@googlegroups.com

gleki wrote:
>>>I know from experience that any and all translation programs are horrid at translation.
>>>Furthermore, I don't see any need to include more languages into the algorithm.
>
> Transliteration *may be* horrid indeed (especially in case of Arabic). However, audio recordings can solve this issue.
> The algorithm was chosen to make people from all over the world learn words quicker.

That wasn't quite the reason, though certainly JCB believed that it was
true. The primary reason was to create a lexicon that was (at least
apparently) NOT biased in favor of any one language to an extent that
exceeded its natural influence. "Cultural neutrality" was the
watchword. There were and are a lot of problems with how JCB formulated
the problem, and the dominance of American English semantics on the
MEANINGS of the words is what I most fear, but that is what we are stuck
with.

We did attempt to gather information using the old LogFlash program to
determine whether indeed recognition scores were predictive of word
learning. We got maybe a dozen data sets from different people, but my
lack of time and statistical analysis skills leaves the analysis of that
old data as one of my never-done tasks.

I suspect that there will be some correlation, but it might only exist
on those words with higher recognition scores. Since more languages
would lower the average score, learnability would likely be hurt.

> If so why limit the number of source languages to 6?

Because any more that 6 was counterproductive, leading to essentially
random words, and even then Arabic in 6th place had very little Lojbanic
significance (in part because of the nature of Arabic morphology). The
extreme population dominance of Chinese and English (including 2nd
language speakers), and the existence of short roots in those languages
means that most Lojban words are basically an amalgamation of those two
languages, with sometimes a little coloration of one of the other languages.

Remember that a word has to match at least 2 letters (and if only 2,
they must be in the right place) in order to contribute to a Lojban
recognition score.

I suspect that any rigorous study would show that the Lojban morphology
cannot effectively represent contributions from more than 3 language
families (in essence, three languages with other languages possibly
reinforcing those three when their roots are similar, which happens most
often when they are in the same language family, or when there has been
significant borrowing). Most often, only two languages/families are
represented.

A couple percentage points different, and Lojban would look like an
amalgamation of Chinese and Hindi. Indeed, per the numbers below, that
is what would probably happen now.

We did experiments with more languages, ranging up to 12, but additional
languages merely gave lower recognition scores (sometimes leading to tie
scores between entirely different strings), and rarely, a letter might
change because it gave a couple more points.

If I had it to do over again, I would make a couple changes in Chinese
transliteration (which would give us more "o" and less "a" in the
language, and perhaps try to find a way to decrease the reinforcing of
fricative sounds that aren't really alike in Chinese). And I would use
entirely different rules for Arabic, because vowels count so little in
their roots compared to consonants, but the Lojban algorithm weights
consonants and vowels more or less equally.

At one point in the 90s, I fiddled with the program to try to do this,
but the original program no longer works properly (parts had been coded
in assembler to speed up the innermost loops back in the 8086 era when a
single word run would take several minutes rather than a few seconds)
and I was a little too rusty on my coding skills.

> Russian is no longer among first 6.

Actually, I think it still is, though I haven't done the calculations in
recent years. The last time I did so, in 2004, it had dropped from 5th
into 6th place, but it was still solidly ahead of Bengali because of
second language speakers; it is probably closer now because Bengali
continues to grow, while Russian is stagnant or waning; both are
probably in the neighborhood of 250 million total speakers. But Russian
isn't very influential in the wordmaking any more than Arabic is, though
it is primarily because Russian roots are quite long. Bengali would
likely have a little more influence, but only to the extent that its
roots reinforce Hindi roots, skewing the language more towards the
Chinese/Hindi amalgam mentioned above.

Next after Bengali is Portuguese, because Indonesian is still primarily
a second language for most people who speak it, and second language
speakers are halved.

The 2004 weighting would have been
Chinese .33
Hindi .21
English .18
Spanish .12
Arabic .09
Russian .07

The 1987 weights were
Chinese .36
Hindi .16
English .21
Spanish .12
Arabic .07
Russian .09

If Bengali replaced Russian or were added, this would slightly
strengthen Hindi. But its weight would be on the same order as Russian,
not enough to actually participate in word-making except where it
reinforces the weight of a Hindi root. Even Spanish has insufficient
weight to participate in many words, except when it reinforces an
English root.

Portuguese would probably significantly reinforce Spanish, perhaps
enough to enable it to match English in weight, but otherwise would
never make any contribution.

Indonesia wouldn't reinforce anything except where it uses a borrowed
word, and thus would have even less effect than Arabic.

> And do those 6 languages really represent the majority of the population of the planet?

Actually yes, but not by much (In 2004, the 6 languages represented 2.7
billion first language speakers and 1.5 billion 2nd language speakers
(with some overlap, especially in Hindi/English speakers, but probably
not so much to not exceed half of the current 7 billion).

But that wasn't the intent.

> The most trustworthy answer is the following.
> If adding more languages changes the resulting sounding then 6 languages are not enough.

Redoing the words with the current Hindi weighting would have a big
change in the language. So would the change in Chinese transliteration.
Any Arabic change would probably help some, but not enough to
significantly change the sound of the language. Adding additional
languages would probably not change the words much (though there might
be some randomization effects), but would lower the recognition scores.

(Masochists who know old Turbo Pascal might be able to do something with
the program, including running some trials with different weightings.
The source is still floating around somewhere on my machine. But IIRC,
the code is poorly-enough documented so that a good programmer could
write something from scratch almost as fast, that would allow them to
try additional languages and see for themselves that it doesn't buy much.)

lojbab

gleki

unread,

May 14, 2012, 5:16:49 AM5/14/12

to loj...@googlegroups.com

>>Remember that a word has to match at least 2 letters (and if only 2,
they must be in the right place) in order to contribute to a Lojban
recognition score.

>>Portuguese would probably significantly reinforce Spanish, perhaps

enough to enable it to match English in weight, but otherwise would
never make any contribution.

>>Indonesia wouldn't reinforce anything except where it uses a borrowed
word, and thus would have even less effect than Arabic.

If Portuguese (and may be other Romanian languages) can reinforce Spanish
and Bengali can do that with Hindi then what's the problem?
And a lower recognition score is just the output of the script and nothing more, just a number. But who knows, may be experimental gismu with lower recognition scores can be easier to learn then gismu based on fewer source languages?

Jonathan Jones

unread,

May 14, 2012, 5:27:49 AM5/14/12

to loj...@googlegroups.com

Lower recognition means harder to learn, not easier, and you apparently missed the point, that being that any languages past the first two contribute negligibly, except in cases where their word is similar to one of those first two.

gleki

unread,

May 14, 2012, 7:15:28 AM5/14/12

to loj...@googlegroups.com

je'e

ki'esai

anyway it would be nice to prove this theory in practice.

and I'm ready to do that if it comes to choosing new gismu.

Reply all

Reply to author

Forward