Problems with training tesseract

102 views
Skip to first unread message

Cristobal Jesus Muñoz Solano

unread,
Aug 2, 2019, 11:56:25 AM8/2/19
to tesseract-ocr
Hello, I am trying to use tesseract and I have read all the documentation and I have done many tests, sorry if this is not the place to ask this question, but I have been researching for several days and I am having many doubts and I do not know what to do or where to investigate , I'm frustrated.

1) If I want to train tesseract to improve its efficiency by reading images with font OCR-B, should I first do a tuning by adding the OCR-B font? or I can create a trainnedata directly with the images/box and then combine it with the best model.

2) How do I add many images / box to the best model.

3) Once you have a .trainneddata ready and save it in tessdata is it enough for you to test when you run it use that data to read the images?

I already tried this script
https://github.com/Shreeshrii/tessdata_ocrb

but I still don't understand how to add new training images to the best model

please help me, I don't want to kill myself so young

Shree Devi Kumar

unread,
Aug 2, 2019, 12:07:34 PM8/2/19
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b63523ed-0e81-483b-a224-ada4c786fa3d%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ElGato ElMago

unread,
Aug 4, 2019, 9:23:47 PM8/4/19
to tesseract-ocr
Did you specify language option for ocrb when you read it?

2019年8月3日土曜日 0時56分25秒 UTC+9 Cristobal Jesus Muñoz Solano:

Cristobal Jesus Muñoz Solano

unread,
Aug 7, 2019, 10:10:44 AM8/7/19
to tesseract-ocr
hello, I have already tried mrz.trainneddata yes quite good, but it is not accurate. How can I do it to improve it? Is it possible to use box / png files to improve its accuracy ?.


El viernes, 2 de agosto de 2019, 12:07:34 (UTC-4), shree escribió:
On Fri, Aug 2, 2019 at 9:26 PM Cristobal Jesus Muñoz Solano <cmun...@gmail.com> wrote:
Hello, I am trying to use tesseract and I have read all the documentation and I have done many tests, sorry if this is not the place to ask this question, but I have been researching for several days and I am having many doubts and I do not know what to do or where to investigate , I'm frustrated.

1) If I want to train tesseract to improve its efficiency by reading images with font OCR-B, should I first do a tuning by adding the OCR-B font? or I can create a trainnedata directly with the images/box and then combine it with the best model.

2) How do I add many images / box to the best model.

3) Once you have a .trainneddata ready and save it in tessdata is it enough for you to test when you run it use that data to read the images?

I already tried this script
https://github.com/Shreeshrii/tessdata_ocrb

but I still don't understand how to add new training images to the best model

please help me, I don't want to kill myself so young

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Mamadou

unread,
Aug 7, 2019, 10:40:40 AM8/7/19
to tesseract-ocr


On Wednesday, August 7, 2019 at 4:10:44 PM UTC+2, Cristobal Jesus Muñoz Solano wrote:
hello, I have already tried mrz.trainneddata yes quite good, but it is not accurate. How can I do it to improve it? Is it possible to use box / png files to improve its accuracy ?.

mrz.trainneddata was generated using thousands of images. I doubt you'll be able to increase the accuracy just by adding more data.

Most of the time the accuracy issues are related to poor image pre-processing.

You can try https://www.doubango.org/webapps/mrz/ which use mrz.trainneddata with the failing images to see if it works. If it works this means the issue is on the pre-processing.

If you share some sample images it would be easier to help you.

Cristobal Jesus Muñoz Solano

unread,
Aug 7, 2019, 3:58:24 PM8/7/19
to tesseract-ocr
I can already generate the .box files using listbox from png images but I don't understand what follows.
How can I use them to improve the best model eng.trainneddata?

Cristobal Jesus Muñoz Solano

unread,
Aug 7, 2019, 4:08:28 PM8/7/19
to tesseract-ocr
this image return L2007190588S37<<<<<<<<<<<<\n77F1912157PER22344783<K<3\n<RODRIGUEZ<<LORENZA<SO
but in https://www.doubango.org/webapps/mrz/ work good. i dont know why :(
Selección_001.png

Bossiel

unread,
Aug 7, 2019, 4:36:13 PM8/7/19
to tesser...@googlegroups.com


Sent from my iPhone

On Aug 7, 2019, at 22:08, Cristobal Jesus Muñoz Solano <cmun...@gmail.com> wrote:

this image return L2007190588S37<<<<<<<<<<<<\n77F1912157PER22344783<K<3\n<RODRIGUEZ<<LORENZA<SO
but in https://www.doubango.org/webapps/mrz/ work good. i dont know why :(
This is what I explained in my previous post. The issue isn’t the trained model but the preprocessing.  The Mrz lines are skewed and the contrast isn’t uniform. Fine tuning the model won’t help. Tesseract uses Otsu binarization, dump the binary result to see how it works.
Using doubango’s online demo it works fine because the lines are de-skewed, de-slanted and binarized using TMMS and Wolf. See https://groups.google.com/forum/m/#!topic/doubango-ai/WxMG4F3X5XQ.
This said, add few pixels to the image to make sure the first letters are taken.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b182f7cd-cb15-439d-a4a0-105aeedd65bb%40googlegroups.com.
<Selección_001.png>

Bossiel

unread,
Aug 7, 2019, 4:48:47 PM8/7/19
to tesser...@googlegroups.com
Also, check the training data to see the number of white spaces at the beginning and ending of the lines. Try adding same number of spaces before passing it to Tesseract

Sent from my iPhone

On Aug 7, 2019, at 22:08, Cristobal Jesus Muñoz Solano <cmun...@gmail.com> wrote:

this image return L2007190588S37<<<<<<<<<<<<\n77F1912157PER22344783<K<3\n<RODRIGUEZ<<LORENZA<SO
but in https://www.doubango.org/webapps/mrz/ work good. i dont know why :(

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b182f7cd-cb15-439d-a4a0-105aeedd65bb%40googlegroups.com.
<Selección_001.png>
Reply all
Reply to author
Forward
0 new messages