Tesseract seems to be removing correctly segmented and oriented blocks for the final classification

94 views
Skip to first unread message

Utkarsh Sinha

unread,
Dec 22, 2015, 2:04:26 AM12/22/15
to tesseract-ocr
Hello!

I'm trying to find out why Tesseract is rejecting certain blobs from the image here. The text "nestle" and "nesquik" have overlapping baselines. I suspect the overlap might be causing it to stop recognizing anything at all.


I've also included some debug images I captured from tesseract. From the images, it seems like Tesseract can correctly identify the blobs for each character and finds the correct baselines. However, it chops the Nesquik from the image based on the "nestle".


Are there any tesseract config options I can use to get this working? 


Thanks!



Tom Morris

unread,
Dec 22, 2015, 4:51:14 PM12/22/15
to tesseract-ocr
On Tuesday, December 22, 2015 at 2:04:26 AM UTC-5, Utkarsh Sinha wrote:
I'm trying to find out why Tesseract is rejecting certain blobs from the image here. The text "nestle" and "nesquik" have overlapping baselines. I suspect the overlap might be causing it to stop recognizing anything at all.

They're not only overlapping, but they are at something like a 30 degree angle to each other.  It doesn't surprise me that Tess considers that an unreasonable amount of interline skew.  Where would one see that in a normal text layout? Additionally, the "Nesquick" isn't really text, but a stylized logotype.

Perhaps consider using SIFT/SURF/etc detectors from OpenCV?

Tom

Utkarsh Sinha

unread,
Dec 22, 2015, 11:41:58 PM12/22/15
to tesseract-ocr
Tom, we did set the "Force parallel baselines" to false. I was hoping that would keep Tesseract for discarding Nesquik. Are there any other parameters I can try tweaking?

While SIFT/Surf/etc are definitely options, I'm currently exploring using an OCR and the its limits. Given enough training, SIFT/etc would work just fine. However, we would have to first gather a lot of data - which isn't possible in our case. The data I'm working with hits us first and later becomes popular and available through Google images. So scraping the internet might not be of much help to us.
Reply all
Reply to author
Forward
0 new messages