Unnecessary extra space with Japanese.traineddata

Atsuyoshi Suzuki

unread,

Jul 23, 2018, 11:31:43 PM7/23/18

to tesseract-ocr

Hi.

I tried new tesseract and traineddata for Japanese (both jpn.traineddata and Japanese.traineddata).

It's very good recognition result with jpn.traineddata.

Japanese.traineddata provide good result but unnecessary space is inserted in words or characters.

Is this behavior expected? In Japanese, there is no space between each words.

If this behavior is expected, what kind of usage is assumed for Japanese.traineddata?

jpn.traineddata (very good, and I expected):

--- start ---

$ tesseract -l jpn test_jpn_04.jpg stdout

Warning. Invalid resolution 0 dpi. Using 70 instead.

Estimating resolution as 168

OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが

できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。

--- end ---

Japanese.traineddata:

--- start ---

$ tesseract -l Japanese test_jpn_04.jpg stdout

Warning. Invalid resolution 0 dpi. Using 70 instead.

Estimating resolution as 168

OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが

できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。

--- end ---

This result is same between Ubuntu (beta.1) and macOS (4.0.0-beta.2-586-g607e).

Thanks.

test_jpn_04.jpg

Shree Devi Kumar

unread,

Jul 24, 2018, 12:44:40 AM7/24/18

to tesser...@googlegroups.com

Which tessdata repository are you using for your trained data files?

tessdata

tessdata_best

tessdata_fast

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Atsuyoshi Suzuki

unread,

Jul 24, 2018, 1:10:42 AM7/24/18

to tesseract-ocr

Hi Shree.

I use tessdata_fast.

2018年7月24日火曜日 13時44分40秒 UTC+9 shree:

Shree Devi Kumar

unread,

Jul 24, 2018, 3:28:22 AM7/24/18

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/tessdata_fast#example---jpn-and--japanese

for Ray's comment regarding the 'script' traineddata.

preserve_interword_spaces 1

was added via jpn.config to jpn.traineddata file and other CJK languages to fix this issue - see https://github.com/tesseract-ocr/tessdata_fast/pull/7

We probably did not make the changes for the script traineddata files

you can test by giving the config variable on command line by adding

-c preserve_interword_spaces 1

(Please check the syntax, it might need a = sign)

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e009654e-7f40-42fb-bc56-6946a60105aa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Atsuyoshi Suzuki

unread,

Jul 24, 2018, 5:29:07 AM7/24/18

to tesseract-ocr

Thank you Shree.

I got same result jpn and Japanese with '-c preserve_interword_spaces=1'.

$ tesseract -l Japanese -c preserve_interword_spaces=1 test_jpn_04.jpg stdout

Unnecessary space problem is solved. Thanks.

2018年7月24日火曜日 16時28分22秒 UTC+9 shree:

Please see https://github.com/tesseract-ocr/tessdata_fast#example---jpn-and--japanese
for Ray's comment regarding the 'script' traineddata.

Does it make sense to assume the case where English sentences and Japanese sentences are mixed in image?

In the case that English words are included in Japanese sentences, it seems that there is not much difference between jpn and Japanese.

mahendrag gajera

unread,

Jul 24, 2018, 10:27:12 AM7/24/18

to tesser...@googlegroups.com

I am using Japanese.traineddata.which gives good result

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9c4290c-c7ce-4395-9e88-db06a60c8281%40googlegroups.com.

Reply all

Reply to author

Forward