Re: Language identification of languages without clear word boundaries

36 views
Skip to first unread message

Alex Ott

unread,
May 16, 2020, 11:29:48 AM5/16/20
to Arbin Timilsina, fastText library
It shouldn't be dependent on the whitespaces. language detection model in
fastText (https://fasttext.cc/blog/2017/10/02/blog-post.html) is using
subword features

When I did my evaluation 3 years ago
(http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html),
I didn't do something special for chinese/korean/japanese texts - I just
took text from the sites, and uses it as-is. You can check yourself - test
data are still available


Arbin Timilsina at "Thu, 9 Apr 2020 07:49:51 -0700 (PDT)" wrote:
AT> I am using fastText for language identification.

AT> The predict() function of FT expects to be given a single line of text and splits words on whitespace. For Chinese or other languages without
AT> clear word boundaries, I am not sure how to inject whitespace. We could use word segmenter before passing the text to FT- but we will need to
AT> know the language beforehand.

AT> Any suggestions on how to pre-process input text for language identification?



--
With best wishes, Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Arbin Timilsina

unread,
May 18, 2020, 9:24:30 AM5/18/20
to Alex Ott, fastText library
Thanks, Alex.

Reply all
Reply to author
Forward
0 new messages