Making it morre easy for tesseract

141 views
Skip to first unread message

kali...@googlemail.com

unread,
Oct 3, 2015, 1:35:51 PM10/3/15
to tesseract-ocr
Lets assume we have 600dpi pictures of typed text, without much noise, border or too much rotation. The results are good, but not great. So what else could we do?

I would like to...

  1. ... know wich file format is best. jpg? png? tiff? I didn't find documentation for that.
  2. ... give tesseract more time for better results. Is there am option to do that? (I didn't find one)
  3. ... give tesseract more computation power for better results. (I didn't find an option)
  4. ... see where tesseract is not sure about things, so I can correct them.

It would be great if someone could provide me some documentation (or even opinion) about this questions.

Meh Hem

unread,
Oct 5, 2015, 6:19:35 AM10/5/15
to tesseract-ocr
Hi,

The format does not make a great deal of difference providing the quality is good.

There are a number of threads on here which discuss useful imagemagick scripts that can improve OCR accuracy.

You cannot give tesseract more/less time to increase accuracy, but you can have some of its job done by other programs.

Giving tesseract more resources will make it faster, not more accurate.

You can use TesserractExtractResult() to see where it is going wrong.

My advice on improving accuracy:
Find what characters are common problems and try to improve through simple image processing (erode and dilate can make big difference).
If font is unique you can get more accurate results by training a new one (this can be time consuming).
Implement a few ocr oriented imagemagick scripts.

I hope this helps.
Reply all
Reply to author
Forward
0 new messages