Re: Language identification of languages without clear word boundaries

36 views

Skip to first unread message

Alex Ott

unread,

May 16, 2020, 11:29:48 AM5/16/20

to Arbin Timilsina, fastText library

It shouldn't be dependent on the whitespaces. language detection model in
fastText (https://fasttext.cc/blog/2017/10/02/blog-post.html) is using
subword features

When I did my evaluation 3 years ago
(http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html),
I didn't do something special for chinese/korean/japanese texts - I just
took text from the sites, and uses it as-is. You can check yourself - test
data are still available

Arbin Timilsina at "Thu, 9 Apr 2020 07:49:51 -0700 (PDT)" wrote:
AT> I am using fastText for language identification.

AT> The predict() function of FT expects to be given a single line of text and splits words on whitespace. For Chinese or other languages without
AT> clear word boundaries, I am not sure how to inject whitespace. We could use word segmenter before passing the text to FT- but we will need to
AT> know the language beforehand.

AT> Any suggestions on how to pre-process input text for language identification?

--
With best wishes, Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Arbin Timilsina

unread,

May 18, 2020, 9:24:30 AM5/18/20

to Alex Ott, fastText library

Thanks, Alex.

Reply all

Reply to author

Forward

0 new messages