tesseract-ocr word spacing problem

334 views
Skip to first unread message

Chang Alden

unread,
Nov 10, 2015, 8:04:35 AM11/10/15
to tesseract-ocr
I am using tesseract-ocr portable version exe with psm 3 to run the rectangle of bitmap for my images. The problem is it has incorrect spacing for some words(e.g. "this group" becomes "thisgroup"). I tried to correct this problem by resizing the bitmap to a larger size which successfully solves this problem, but then other spacing problem appears(e.g. "apple" becomes "appl e"). The words in the example is not the same as my test file but due to company policy I cannot reveal them. I think resizing the bitmap might not be the best way to solve the spacing problem. Is there other methods I can try out?

Daniel Kraft

unread,
Nov 10, 2015, 10:05:34 AM11/10/15
to tesser...@googlegroups.com
Hi!

On 2015-11-10 14:04, Chang Alden wrote:
> I think resizing the bitmap might not be the best way to solve the
> spacing problem. Is there other methods I can try out?

I'm by no means a tesseract or OCR expert, but I've been experiencing
spacing issues myself (in my case, with columns of numbers).

For me, it worked very well to rotate failing images a few degrees. My
data contains checksums, so that I can determine automatically if a
recognition was correct or not; applying various rotations until it
succeeds works very well to improve my recognition rate significantly
(including for spaces). Not sure if that's an option for you, though.

Yours,
Daniel

--
http://www.domob.eu/
OpenPGP: 1142 850E 6DFF 65BA 63D6 88A8 B249 2AC4 A733 0737
Namecoin: id/domob -> https://nameid.org/?name=domob
--
Done: Arc-Bar-Cav-Hea-Kni-Ran-Rog-Sam-Tou-Val-Wiz
To go: Mon-Pri

signature.asc

Chang Alden

unread,
Nov 10, 2015, 10:19:58 AM11/10/15
to tesseract-ocr
Hi!

Thanks for the reply. I found HOCR in the command line option which shows the coordinates for the words, but which config file in tessdata/configs for Tesseract version 3.02 should I modify to get character confidence output? Thanks!

Daniel Kraft於 2015年11月10日星期二 UTC+8下午11時05分34秒寫道:
Reply all
Reply to author
Forward
0 new messages