Request for a full frequency list of all lojbanic words for an Android app.

145 views
Skip to first unread message

la gleki

unread,
Apr 12, 2013, 10:57:17 AM4/12/13
to loj...@googlegroups.com
peeps, i need ur help.
we are gonna have Swype/Swipe feature for MultiLing android keyboard. I need a list of all lojbanic words + frequency of each.
i know of a gismu frequency list. But it seems that not all gismu are there (less than 1342). What about cmavo, fu'ivla?

Of course, most rare words can be given the lowest rating but what are the most frequent words?
Can we rerun the algorithm to count all the occurrencies of all words?

Robin Lee Powell

unread,
Apr 15, 2013, 3:51:25 PM4/15/13
to loj...@googlegroups.com
http://users.digitalkingdom.org/~rlpowell/hobbies/lojban/flashcards/?C=M;O=D
-- the _freq lists should have everything.

It should be pretty easy to regenerate this stuff with the latest
from http://corpus.lojban.org/ , but I am (as usual) not
volunteering.

-Robin

la gleki

unread,
Apr 16, 2013, 3:36:01 AM4/16/13
to loj...@googlegroups.com
Is there a script that can generate such lists?
 

-Robin

Ross Ogilvie

unread,
Apr 16, 2013, 4:41:14 AM4/16/13
to loj...@googlegroups.com
A quick search found this little gem for word frequencies

tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2

Running it against the corpus gives the attached frequencies. However don't use this freq list, as it includes many english words, abbreviations and author's names. Ideally one would clean the corpus of non-lojban and then run this script on it.

-- Ross

 

-Robin

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+un...@googlegroups.com.
To post to this group, send email to loj...@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

freq.bz2

Robin Lee Powell

unread,
Apr 16, 2013, 5:09:20 AM4/16/13
to loj...@googlegroups.com
The scripts I used are in that same directory; not sure what's what
at this point, though.

-Robin

Ross Ogilvie

unread,
Apr 16, 2013, 5:34:16 AM4/16/13
to loj...@googlegroups.com
Okay, I filtered my previous frequency list of lojban words, removing all cmene and non lojban words, then manually picked out some author's names that are brivla.

Please find attached.

-- Ross


-Robin

filtered_freq.txt

la gleki

unread,
Apr 16, 2013, 6:02:11 AM4/16/13
to loj...@googlegroups.com


On Tuesday, April 16, 2013 1:34:16 PM UTC+4, Ross Ogilvie wrote:
Okay, I filtered my previous frequency list of lojban words, removing all cmene and non lojban words, then manually picked out some author's names that are brivla.


What do you mean by corpus? irc log saves only parsable sentences. But i still can see many english words. What is the source of this corpus?

Also i think that we can trim the list to only first 5000 words/clusters. The rest can be added manually from jbovlaste.

Ross Ogilvie

unread,
Apr 16, 2013, 6:27:06 AM4/16/13
to loj...@googlegroups.com
By corpus, I mean the collection of texts found here http://corpus.lojban.org/ At the top of the page there is a link to download them all in one text file. I took that document, ran a word frequency sorter on it, then filtered out all the non-lojban words using cmafi'e (available in this arch package https://aur.archlinux.org/packages/jbofihe-git/ , thank you zorun).

I had a quick look and only spotted one english word in the first 1000: kinda. And there are some nonsense words like tene. Getting rid of these sorts of things would be much more time consuming. But if you want me to try something specific, let me know.

btw, the two scripts needed are

#!/bin/bash
### word_count

tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2

#!/bin/zsh
##### filter_lojban
while read i
do
j=$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -v -e "CMENE" )
if [[ -n $j ]]; then
    echo $i
fi
done < $1

And then run them like this, assuming corpus.txt is your source
word_count < corpus.txt > freq.txt
filter_lojban freq.txt > filtered_freq.txt

-- Ross

la gleki

unread,
Apr 16, 2013, 6:57:54 AM4/16/13
to loj...@googlegroups.com


On Tuesday, April 16, 2013 2:27:06 PM UTC+4, Ross Ogilvie wrote:
By corpus, I mean the collection of texts found here http://corpus.lojban.org/ At the top of the page there is a link to download them all in one text file. I took that document, ran a word frequency sorter on it, then filtered out all the non-lojban words using cmafi'e (available in this arch package https://aur.archlinux.org/packages/jbofihe-git/ , thank you zorun).

I had a quick look and only spotted one english word in the first 1000: kinda.

"eimi" is among the first 30 words. It's a name. {kinda} can also be a joke gismu, though.

And there are some nonsense words like tene. Getting rid of these sorts of things would be much more time consuming. But if you want me to try something specific, let me know.

I'll run another script on irc log.

Pierre Abbat

unread,
Apr 16, 2013, 8:35:58 AM4/16/13
to loj...@googlegroups.com
On Tuesday, April 16, 2013 19:34:16 Ross Ogilvie wrote:
> Okay, I filtered my previous frequency list of lojban words, removing all
> cmene and non lojban words, then manually picked out some author's names
> that are brivla.

9003 eimi (plausible beginning of a sentence, but more likely someone's name)
7841 teryrei
6796 tene (name, not a valid cmavo cluster)
3028 durka
2260 like
1892 gejyspa
1718 clyde
1437 sure
1426 tengo
1161 pafcribe
1154 nope
1121 sonja
1065 komfo
944 side
933 one
916 cirzgamanti
895 gunkamanti
849 mrtanooki
723 use
710 niekie
682 latro'a
589 very
508 azetidine
504 freenode
503 some
469 because
492 time

--
I believe in Yellow when I'm in Sweden and in Black when I'm in Wales.

la gleki

unread,
Apr 17, 2013, 12:49:08 PM4/17/13
to loj...@googlegroups.com
ki'e .piER. 
I analysed the list till the line No. 4070 when i stopped because words started to be so boring....anyway at that time they had absolute frequency not mmore than 11 times.

So i trimmed that list, multiplied all frequencies by 10 and added all the other words from jbovlaste (that were not already among those 4070).
Those words from jbovlaste were given apriori frequency "1".

I also changed the frequencies of the words "tsani, gleki, selpa'i" cuz i dont know what REAL frequencies of these non-cmevla are.

I think the resulting list (attached to this message) is much more appropriate than the first one.

Please, check it. Suggestions are more than welcomed.
MyFreq-COMB.txt

la gleki

unread,
Apr 20, 2013, 9:44:09 AM4/20/13
to loj...@googlegroups.com
Now I'm attaching two lists.

1."MyFreq-COMB without dots" is a normal general-purpose list that has dots removed from words starting with vowels. You can use it for flashcard memorising apps or similar.
2."MyFreq-COMB with and without dots" is for apps like MultiLing that need to know that we can both type words with and without dots. So e.g. it has both {.i} and {i} with the same frequency.
MyFreq-COMB without dots.txt
MyFreq-COMB with and without dots.txt

iesk

unread,
Apr 21, 2013, 10:06:46 AM4/21/13
to loj...@googlegroups.com
Gleki, why do you remove proper names from the frequency lists, by the way? If they are often-typed words, would it not be useful to have them, as well, suggested by a typing application? (Sorry, I have only vague ideas of what that software actually does. It is about suggesting words during typing, isn’t it?)

-iesk

la gleki

unread,
Apr 21, 2013, 11:44:48 AM4/21/13
to loj...@googlegroups.com


On Sunday, April 21, 2013 6:06:46 PM UTC+4, iesk wrote:
Gleki, why do you remove proper names from the frequency lists, by the way? If they are often-typed words, would it not be useful to have them, as well, suggested by a typing application? (Sorry, I have only vague ideas of what that software actually does. It is about suggesting words during typing, isn’t it?)

yes, but it'd extremely strange to have names there. Their frequency might change over time. Anyway, one can easily add new words while using the app.
So please add names of your close lojbani friends yourself. This app is not only for those who were most active {irci} in past. Inthis regard this frequency dictionary is biased but that's that we have now.


-iesk

la gleki

unread,
Feb 9, 2016, 10:56:53 AM2/9/16
to lojban
For MultiLing the problem seems to have been solved.
For a bit better frequency list where utterances to/from bots, non Lojban vocatives and other service information is removed see 

1-grams is basically frequency list.

Reply all
Reply to author
Forward
0 new messages