Tessercat 4.0 korean detecting chinese

Fanatico

unread,

Apr 9, 2018, 1:39:53 AM4/9/18

to tesseract-ocr

I'm running tesseract with the "-l kor" param but it is detecting some chinese characters, the image really have 3 chinese characters but none of them is returning correctly (and I'm not expecting them to return correctly) but the others korean characters are being recognized as chinese characters

tesseract teste_kor.tif teste_kor -l kor --oem 3 --psm 6

Any idea of how to fix it?

Result:

1 화

서 05)

수 마 0 뜨 \) 에 사 로 잡혀 눈 을 도 저

히 뜰 수가 없다.

힘 을 내 도 겨우 반 개 하는 것이 고

작 . 그 이상 움직일 수가 없었다.

" 아 ‥…. 7

苗朮習趾葉刁估舍點選們同對刀

려 소 리 를 낸다. 하지만 신 음 에 가

까운 목 소 리 만 홀 러 나 올 뿐이었다.

“장로 Q 全程 ::: 가 시 면 ‥.”

ShreeDevi Kumar

unread,

Apr 9, 2018, 1:55:20 AM4/9/18

to tesser...@googlegroups.com

Which traineddata are you using?

Use combine_tessdata and extract the config file to see if chinese is included as sub language.

Also look at the lstm-unicharset to see if the Chinese characters are included in it.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1e5142e1-d198-46d3-95ee-1a3206d1a2c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fanatico

unread,

Apr 9, 2018, 2:18:50 AM4/9/18

to tesseract-ocr

I used one traineddata that I created on removing the top layer from the kor.traineddata from "tessdata_best", after this I replaced this traineddata with the one from "tessdata_best" and got the same problem.

Yes, it include chi_tra as sublanguage

tessedit_load_sublangs chi_tra

lstm-unicharset only has corean characters

ShreeDevi Kumar

unread,

Apr 9, 2018, 2:24:44 AM4/9/18

to tesser...@googlegroups.com

Please remove the sub language line from config file, and use combine tessdata to overwrite it.

Right now it seems to be using chi_tra also.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com.

Fanatico

unread,

Apr 9, 2018, 3:22:11 AM4/9/18

to tesseract-ocr

It worked, thanks.

Any reason for this chi_tra there?

ShreeDevi Kumar

unread,

Apr 9, 2018, 4:15:57 AM4/9/18

to tesser...@googlegroups.com

Leftover from 3.04, my guess.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%40googlegroups.com.

ShreeDevi Kumar

unread,

Apr 9, 2018, 5:45:10 AM4/9/18

to tesser...@googlegroups.com

For Korean, please check whether adding the following lines to config, improves your results further.

#Fixes https://github.com/tesseract-ocr/tesseract/issues/1009

preserve_interword_spaces 1

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 9, 2018 at 1:45 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:

Leftover from 3.04, my guess.

On Mon 9 Apr, 2018, 12:52 PM Fanatico, <fanati...@gmail.com> wrote:

It worked, thanks.

Any reason for this chi_tra there?

On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
Please remove the sub language line from config file, and use combine tessdata to overwrite it.

Right now it seems to be using chi_tra also.

On Mon 9 Apr, 2018, 11:48 AM Fanatico, <fanati...@gmail.com> wrote:
I used one traineddata that I created on removing the top layer from the kor.traineddata from "tessdata_best", after this I replaced this traineddata with the one from "tessdata_best" and got the same problem.

Yes, it include chi_tra as sublanguage
tessedit_load_sublangs chi_tra

lstm-unicharset only has corean characters

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

Fanatico

unread,

Apr 9, 2018, 9:32:52 AM4/9/18

to tesseract-ocr

The conf from kor did already have it

Fanatico

unread,

Apr 11, 2018, 3:36:24 PM4/11/18

to tesseract-ocr

After some research in Korean I found that they do use Chinese characters in their language, so it is correct to set Chinese as a sublanguage, the problem is that the kor.training_text doesn't have chinede letters, so the code is only training Korean and ignoring the Chinese, so if I tesseract on an image that has Korean and Chinese it is going to recognize some Korean characters as Chinese and some Chinese characters as Korean.

Reply all

Reply to author

Forward