What can be done to improve the accuracy of extract

Prav

unread,

Jul 15, 2017, 2:37:53 AM7/15/17

to tesseract-ocr

Hi,

I have a good quality image from which text has to be extracted. I am getting the output with a lot of junk characters

The image is attached.

Corresponding hocr is also attached.

Any ideas on how the quality of output can be improved. There is a lot of junk coming out for a good quality image.

70tc.jpg

70.hocr

Tom Morris

unread,

Jul 15, 2017, 5:01:22 PM7/15/17

to tesseract-ocr

That doesn't look too bad for a completely untuned result.

The first thing you probably want to do is line removal and perhaps some segmentation of the image form. There are many, many threads in the archive concerning this, so I'm sure you'll find some good tips.

Tom

srn...@gmail.com

unread,

Jul 17, 2017, 6:20:20 AM7/17/17

to tesseract-ocr

Hello Tom,

So, if for particular case is considered and to be trained.. then i should use one training text with lot of fonts (or)
so much of training text(so many copies of training text in one file one by one) with only one font.
What would be the ideal choice then,...

Can you please tell briefly about it..

Thanks...

Reply all

Reply to author

Forward