hOCR verification and editing plus non-word characters

Misti Hamon

unread,

Mar 24, 2024, 2:29:11 PM3/24/24

to tesseract-ocr

I'm going to preface this with, I haven't actually done an OCR run yet (by the time any replies come in, I probably will have, the source image editing is almost done).

I'm working with some photoscanned images of knitting related work (so, there are some non-word characters and acronyms used, most are still English but there are occasional symbols, some standard ascii or unicode, others specialty - I should be able to exclude the specialty symbols and keep them as an image, or at least I hope so), based on tesseract being a "groups of words" based recognition, it sounds like this might produce unexpected results? (example of a line that might show up that could cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 - doesn't look like English words, kind of looks like a sentence *if* you assume a space or comma denotes a that which came before is a word)

So, in order to handle/fix stuff like that, without training, I'm looking for tips on how to inspect my hOCR files to verify and, if necessary, correct the results, that work on linux without running wine. I am looking into the tools suggested in the "Post OCR Verification and Editing" conversation, but that poster is on windows, with a different toolchain, so, not sure all apply to me.

Ger Hobbelt

unread,

Mar 25, 2024, 7:50:32 AM3/25/24

to tesseract-ocr

In your scenario, I would check performance of both modern lstm (v4/v5 engine) and old "classic" v3 OCR engine in tesseract. Just for completeness sake; first tests would be in separate runs so I'ld be able to check the output quality of both runs into HOCR format. (2 separate runs so I don't have to bother within tesseract internal heuristic to "pick the best one" and only dump that one: if I were you I'ld want to see both processes' performance and decide what to do after that.

Postprocessing is akin to "fixing it in the mix": you only do that when all other options have been depleted.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com.

Misti Hamon

unread,

Apr 29, 2024, 12:53:15 PM4/29/24

to tesser...@googlegroups.com

Thank you for your reply, and please forgive my delay, it took me much longer to finish preprocessing my images than I anticipated (actually, was lead to believe it would take - but probably because I'm working with a textbook type layout and not a novel type layout right now).

To confirm, you are suggesting a run with --oem 0 set and a second run with --oem 1 set and then compare the results, correct?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com.

Reply all

Reply to author

Forward