With PSM 11, Tesseract struggles with text rotated by 90 degrees, and text that has neighboring non-text graphical elements. PSM 3 gets nicer and tighter text boxes, but then seemingly rejects the "easiest" texts on the sheet.
I am including screenshots to show this.
It isn't clear to me if OSD is meant for orientation of the whole page or orientation of individual text elements on the page
For example I would prefer it didn't include the CL symbol because that gave it a 0 confidence score, even though it did in fact recognize correctly.
I just don't know how to optimize it with the right config variables.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c60cf545-4d52-4333-8790-4f2442fc517fn%40googlegroups.com.
Yeah it seems page segmentation is the crucial issue. If the bounding boxes are good, the recognition is usually very good.I think I've sort of reached the limit on what I can do with base Tesseract. I think the next step would be custom training / fine-tuning.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a6e0271-db4b-4624-bada-51167dd6d744n%40googlegroups.com.
With such clear diagrams, there might be value in having OpenCV remove the horizontal and vertical lines, and then identifying and merging the blobs that are left to get the regions for recognition. I tried this a bit with one of your examples, it would take more refinement but there might be a path to getting good bounding boxes at the image level.
art
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a6e0271-db4b-4624-bada-51167dd6d744n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YQBPR0101MB990290D0BD05A1D3F3A8BA40DCB3A%40YQBPR0101MB9902.CANPRD01.PROD.OUTLOOK.COM.