Thank you, but if we are to tokenize Japanese sentences, we will
probably use Mecab instead because we already have it integrated to
autogenerate furigana. By the way, I had a quick look at the file
sentences_jp_tokenized.json you committed on that repository. It failed
to parse correctly the first two sentences ("にちょっと" and "何かし"),
so I wonder if that tokenizer is reliable. Mecab is by no means a
state-of-the-art tokenizer but it has Python bindings if you want to
give it a try.
I’d love to get proper tokenization of Japanese to enhance search
results in Tatoeba. (But even then, that wouldn’t be useful to you
unless we provide these search results over an API.) It’s just that we
have too many things to do and too few hands helping.
— gillux