How to increase tesseract model accuracy

182 views
Skip to first unread message

fady taher

unread,
Apr 29, 2019, 6:01:33 AM4/29/19
to tesseract-ocr
the model keeps outputting (5) instead of (S), I tried to go with finetune, but it seems the process messed up the whole model ... how can I increase the model accuracy

Jonathan Muller

unread,
Apr 29, 2019, 11:32:14 PM4/29/19
to tesser...@googlegroups.com
If you know you won't have numbers, what worked for me is blacklisting numbers. Otherwise you will have to improve the image quality (like resizing to bigger size and sharping the edges)

On Mon, 29 Apr 2019 at 12:01, fady taher <fadyt...@gmail.com> wrote:
the model keeps outputting (5) instead of (S), I tried to go with finetune, but it seems the process messed up the whole model ... how can I increase the model accuracy

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4687f9cb-ebc9-443d-bdbb-e9ba50f8014c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Jonathan
06.49.32.74.55

tc...@zips.uakron.edu

unread,
May 3, 2019, 3:20:18 PM5/3/19
to tesseract-ocr
How did you add a blacklist?


On Monday, April 29, 2019 at 11:32:14 PM UTC-4, Jonathan wrote:
If you know you won't have numbers, what worked for me is blacklisting numbers. Otherwise you will have to improve the image quality (like resizing to bigger size and sharping the edges)

On Mon, 29 Apr 2019 at 12:01, fady taher <fadyt...@gmail.com> wrote:
the model keeps outputting (5) instead of (S), I tried to go with finetune, but it seems the process messed up the whole model ... how can I increase the model accuracy

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--
Jonathan
06.49.32.74.55

fady taher

unread,
May 5, 2019, 8:10:39 AM5/5/19
to tesseract-ocr
I do have numbers but this character "S" is pretty obvious, yet I think it keeps recognizing it with wrong value "5" due to the parentheses"(" and ")"


On Tuesday, April 30, 2019 at 5:32:14 AM UTC+2, Jonathan Muller wrote:
If you know you won't have numbers, what worked for me is blacklisting numbers. Otherwise you will have to improve the image quality (like resizing to bigger size and sharping the edges)

On Mon, 29 Apr 2019 at 12:01, fady taher <fadyt...@gmail.com> wrote:
the model keeps outputting (5) instead of (S), I tried to go with finetune, but it seems the process messed up the whole model ... how can I increase the model accuracy

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--
Jonathan
06.49.32.74.55

shree

unread,
May 5, 2019, 9:17:55 AM5/5/19
to tesseract-ocr
Share an image for testing.

How did you try to finetune?

fady taher

unread,
May 5, 2019, 9:28:47 AM5/5/19
to tesseract-ocr
I followed the instructions https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00.md#fine-tuning-for-impact , I added (S) for about 17 times in eng.training_text (attached)
16_0.jpg
eng.training_text

Shree Devi Kumar

unread,
May 5, 2019, 10:02:05 AM5/5/19
to tesser...@googlegroups.com
Which font did you use? Hopefully it was similar to your image. How many iterations?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

fady taher

unread,
May 5, 2019, 10:03:28 AM5/5/19
to tesseract-ocr
I used  option --fontlist "Calibri"  and --max_iterations 3600

Shree Devi Kumar

unread,
May 5, 2019, 10:05:21 AM5/5/19
to tesser...@googlegroups.com
Try with max-iterations 400

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
May 5, 2019, 10:19:15 AM5/5/19
to tesser...@googlegroups.com
Problem seems to be with with the jpg image that you are using.

Get correct results when using the pdf file with gimagereader.

Frequency Multipliers:


50 HZ 120 HZ 400 HZ 1 KHZ 10 KHZ 100 KHZ


0.9 1 1 1.15 (1/125 1.25


PHYSICAL DIMENSIONS


Diameter (D): 22 mm + 1mm


Length (L): 25 mm + 2 mm


Lead Spacing (S): 10 mm +/- 0.1 mm

Coating: BrownPET sleeving 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

fady taher

unread,
May 5, 2019, 10:19:38 AM5/5/19
to tesseract-ocr
Try with max-iterations 400

fady taher

unread,
May 5, 2019, 10:21:47 AM5/5/19
to tesseract-ocr
Any recommendation on how to convert pdf to image to be used with tesseract-ocr then ? any preferred tool to do that ?

fady taher

unread,
May 5, 2019, 10:46:47 AM5/5/19
to tesseract-ocr
the current tool am using is image magic I tried to convert the PDF to Image using another tool, and it seems the result did come out correct


On Sunday, May 5, 2019 at 4:19:15 PM UTC+2, shree wrote:

Zdenko Podobny

unread,
May 5, 2019, 1:53:53 PM5/5/19
to tesser...@googlegroups.com
I am not sure what OS you use, but AFAIK ImageMagick should use internally ghostscript.
After several testing (in other project) I found this command (windows version, for other OS you need to find correct name of ghostscript executable) for converting pdf to tiff:

gswin64c.exe -dBATCH -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dNOPAUSE -r300 -sDEVICE=tiffgray  -sOutputFile=output.tif input.pdf

Note: I prefer to use tiffgray instead of tiffg4, because tiffg4  output is usually ugly (you can get much better result if you convert color image to gray first and in text step to g4/binary color).

Other option is to use poppler (quite paint to make it working on windows, but no problem on linux) - there is utility pdftoppm, that can produce jpg, png or tiff output, decrease color space (gray, mono), specify tiff compression (none, packbits, jpeg, lzw, deflate)...

In my option these are only working opensource free multiplatform solutions with reasonable options.

Zdenko


ne 5. 5. 2019 o 16:46 fady taher <fadyt...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages