Unnecessary extra space with Japanese.traineddata

191 views
Skip to first unread message

Atsuyoshi Suzuki

unread,
Jul 23, 2018, 11:31:43 PM7/23/18
to tesseract-ocr
Hi.

I tried new tesseract and  traineddata for Japanese (both jpn.traineddata and Japanese.traineddata). 

It's very good recognition result with jpn.traineddata.

Japanese.traineddata provide good result  but unnecessary space is inserted in words or characters.



Is this behavior expected? In Japanese, there is no space between each words.

If this behavior is expected, what kind of usage is assumed for Japanese.traineddata?



jpn.traineddata (very good, and I expected):

--- start ---
$ tesseract -l jpn  test_jpn_04.jpg stdout
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 168
OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。

--- end ---


Japanese.traineddata:

--- start ---
$ tesseract -l Japanese  test_jpn_04.jpg stdout
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 168
OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。

--- end ---


This result is same between Ubuntu (beta.1) and macOS (4.0.0-beta.2-586-g607e).



Thanks.
test_jpn_04.jpg

Shree Devi Kumar

unread,
Jul 24, 2018, 12:44:40 AM7/24/18
to tesser...@googlegroups.com
Which tessdata repository are you using for your trained data files?

tessdata
tessdata_best
tessdata_fast



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Atsuyoshi Suzuki

unread,
Jul 24, 2018, 1:10:42 AM7/24/18
to tesseract-ocr
Hi Shree.

I use tessdata_fast.


2018年7月24日火曜日 13時44分40秒 UTC+9 shree:

Shree Devi Kumar

unread,
Jul 24, 2018, 3:28:22 AM7/24/18
to tesser...@googlegroups.com
for Ray's comment regarding the 'script' traineddata.


preserve_interword_spaces 1

  was added via  jpn.config to jpn.traineddata file and other CJK languages to fix this issue - see https://github.com/tesseract-ocr/tessdata_fast/pull/7

We probably did not make the changes for the script traineddata files

you can test by giving the config variable on command line by adding 

-c  preserve_interword_spaces 1


(Please check the syntax, it might need a = sign)


For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Atsuyoshi Suzuki

unread,
Jul 24, 2018, 5:29:07 AM7/24/18
to tesseract-ocr
Thank you Shree. 


I got same result jpn and Japanese  with '-c preserve_interword_spaces=1'. 

$ tesseract -l Japanese -c preserve_interword_spaces=1 test_jpn_04.jpg stdout

Unnecessary space problem is solved. Thanks.


2018年7月24日火曜日 16時28分22秒 UTC+9 shree:
for Ray's comment regarding the 'script' traineddata.




Does it make sense to assume the case where English sentences and Japanese sentences are mixed in image?

In the case that English words are included in Japanese sentences, it seems that there is not much difference between jpn and Japanese.

mahendrag gajera

unread,
Jul 24, 2018, 10:27:12 AM7/24/18
to tesser...@googlegroups.com
I am using  Japanese.traineddata.which gives good result

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages