It shouldn't be dependent on the whitespaces. language detection model in
fastText (
https://fasttext.cc/blog/2017/10/02/blog-post.html) is using
subword features
When I did my evaluation 3 years ago
(
http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html),
I didn't do something special for chinese/korean/japanese texts - I just
took text from the sites, and uses it as-is. You can check yourself - test
data are still available
Arbin Timilsina at "Thu, 9 Apr 2020 07:49:51 -0700 (PDT)" wrote:
AT> I am using fastText for language identification.
AT> The predict() function of FT expects to be given a single line of text and splits words on whitespace. For Chinese or other languages without
AT> clear word boundaries, I am not sure how to inject whitespace. We could use word segmenter before passing the text to FT- but we will need to
AT> know the language beforehand.
AT> Any suggestions on how to pre-process input text for language identification?
--
With best wishes, Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)