Lojban tokenizer for machine learning, first version

38 views
Skip to first unread message

Oleg Parashchenko

unread,
Jun 12, 2022, 9:31:52 AM6/12/22
to lojban
Hello everyone,

I've just released the first version of a lojban tokenizer. It is intended for use in machine learning applications and therefore is a bit different from a linguistic tokenizer. In particular, it does sub-word tokenization.

Additionally, there is a lexer, which can be used to develop alternative tokenizers.

Home page: https://github.com/olpa/lojban-mt/tree/master/tokenizer/

Fast start:

```
$ VERSION=1.0.0
$ pip3 install https://github.com/olpa/lojban-mt/releases/download/tokenizer-v${VERSION}/jbotokenizer-${VERSION}.tar.gz

$ echo 'coirodo' | jboparse.py
coi ro do

$ jboparse.py coi ro do
coi ro do

$ jboparse.py coi ro do --lex
(<TokenClass.CMAVO: 2>, 'coi') (<TokenClass.SKIP: 1>, ' ')
(<TokenClass.CMAVO: 2>, 'ro') (<TokenClass.SKIP: 1>, ' ')
(<TokenClass.CMAVO: 2>, 'do')

$ jboparse.py lojbangirz
logji## bangu## girzu

$ python3
>>> from jbotokenizer import text_to_tokens
>>> text_to_tokens('ma nuzba')
['ma', 'nuzba']

Regards,
Oleg

scope845h...@icebubble.org

unread,
Jun 14, 2022, 3:33:27 PM6/14/22
to loj...@googlegroups.com
Oleg Parashchenko <ol...@uucode.com> writes:

> I've just released the first version of a lojban tokenizer. It is intended
> for use in machine learning applications and therefore is a bit different
> from a linguistic tokenizer. In particular, it does sub-word tokenization.
>
> Additionally, there is a lexer, which can be used to develop alternative
> tokenizers.

.uanai How is that different from any of the other Lojban parsers that
have been written? I am interested in your lexer, however. Which
version of the grammar did you use? The PEG? I'd be very curious to
see how your lexer distinguishes between lujvo and fu'ivla.

scope845h...@icebubble.org

unread,
Jun 19, 2022, 9:07:11 PM6/19/22
to loj...@googlegroups.com
Oleg Parashchenko <ol...@uucode.com> writes:

> ```
> $ jboparse.py la .alis. citka le spageti
> la a li s citka le spati## gento i
> ```

LOL. Eating the vegetative Argentinian? That doesn't sound right!

Michael Turniansky

unread,
Jun 20, 2022, 7:07:08 AM6/20/22
to loj...@googlegroups.com
😂

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lojban/86y1xs40dy.fsf%40cmarib.ramside.
Reply all
Reply to author
Forward
0 new messages