By corpus, I mean the collection of texts found here http://corpus.lojban.org/
At the top of the page there is a link to download them all in one text file. I took that document, ran a word frequency sorter on it, then filtered out all the non-lojban words using cmafi'e (available in this arch package https://aur.archlinux.org/packages/jbofihe-git/
, thank you zorun).
I had a quick look and only spotted one english word in the first 1000: kinda. And there are some nonsense words like tene. Getting rid of these sorts of things would be much more time consuming. But if you want me to try something specific, let me know.
btw, the two scripts needed are
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2
while read i
j=$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -v -e "CMENE" )
if [[ -n $j ]]; then
done < $1
And then run them like this, assuming corpus.txt is your source
word_count < corpus.txt > freq.txt
filter_lojban freq.txt > filtered_freq.txt