Problem when using custom-trained model with default tesseract 4 model

Rujrawee K

unread,

Oct 1, 2018, 5:33:04 AM10/1/18

to tesseract-ocr

Hi,

After I trained my custom Thai language model to use in my tesseract 4, it's working fine(not talking about the accuracy) but it cannot read the English language due to not included in the model so I'm trying to combine my custom tha lang with default eng lang with "-l custom_tha+eng" the output shows that the tesseract still cannot read english texts but when I swap to "-l eng+custom_tha" it can read english text now but not the thai texts, it's like that tesseract only use 1 model to read the text. but when using both tha and eng default model from tesseract 4 it's working fine.

my question is why and any solution/suggestion for this problem?

Regards

Shree Devi Kumar

unread,

Oct 1, 2018, 9:26:48 PM10/1/18

to tesser...@googlegroups.com

Have you tried

https://github.com/tesseract-ocr/tessdata_fast/blob/master/script/Thai.traineddata

which is supposed to support both Thai and English

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5cd91f67-0aa1-40a3-a605-4b90d413b2cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Rujrawee K

unread,

Oct 1, 2018, 11:08:48 PM10/1/18

to tesseract-ocr

Hi Shree,

Yes we tried that and it's working ok, but my problem is when I'm trying to train a new thai model and then use it with default eng model from tess4 like "-l custom_tha+eng" it can only read in 1 language that comes first in the command, in this case "custom_tha" and result is the same for "-l eng+custom_tha" it will only read "eng" but when using both languages default model from tess4 it can read both languages at the same time with out a problem except the accuracy. do I missed something?

เมื่อ วันอังคารที่ 2 ตุลาคม ค.ศ. 2018 8 นาฬิกา 26 นาที 48 วินาที UTC+7, shree เขียนว่า:

Shree Devi Kumar

unread,

Oct 1, 2018, 11:14:11 PM10/1/18

to tesser...@googlegroups.com

1. Have you trained for legacy tesseract engine or for LSTM?

2. Which default traineddata are you using?

3. For us to test, please provide an image and the commands used for testing and the output you got.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4364a760-774d-4e0f-83c6-8210e0a0f824%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Rujrawee K

unread,

Oct 2, 2018, 1:01:55 AM10/2/18

to tesseract-ocr

ok, Shree, I miscommunicated with my colleague, he said this problem occurred on both default and custom trained model, I mean no matter what model are used if I trained in single language with no other language using in the training process and use it with other model with "-l" and having both language in the same line it will read in 1 language but works fine on single language in that line(please find result below for clearer explanation)

my answers are as below :

we trained for using with LSTM
we used "tessdata_best"
code as show below

config_name = ('-l eng+tha --oem 1 --psm 3 -c preserve_interword_spaces=1')
im_name = cv2.imread(img_path_name, cv2.IMREAD_COLOR)
text_name = pytesseract.image_to_string(im_name,config=config_name)
print (text_name)

The result is :

as you can see if the input image have both language(eng+thai) in the same line it will read only in 1 language but when having single language in that line it will read in correct language these are both default model(same result with custom model)

เมื่อ วันอังคารที่ 2 ตุลาคม ค.ศ. 2018 10 นาฬิกา 14 นาที 11 วินาที UTC+7, shree เขียนว่า:

Shree Devi Kumar

unread,

Oct 2, 2018, 9:53:06 AM10/2/18

to tesser...@googlegroups.com

There is an open issue with similar problem in issue tracker. It will help to move the discussion there.

I will test with your sample image and also post link to the issue.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c1cae97-8232-41cf-8143-2fe9870378c6%40googlegroups.com.

Shree Devi Kumar

unread,

Oct 2, 2018, 1:21:27 PM10/2/18

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/tesseract/issues/1579

and continue further discussion there.

Rujrawee K

unread,

Oct 3, 2018, 2:28:27 AM10/3/18

to tesseract-ocr

thank you Shree, I will let my colleague know and continue this discussion there.

เมื่อ วันพุธที่ 3 ตุลาคม ค.ศ. 2018 0 นาฬิกา 21 นาที 27 วินาที UTC+7, shree เขียนว่า:

Reply all

Reply to author

Forward