subscript and superscript recognition problems

84 views

Skip to first unread message

An Keilha

unread,

Aug 31, 2020, 8:40:41 AM8/31/20

to tesseract-ocr

A colleage and I are having problems with recognition of subscript and superscript. We posted our problem on StackOverflow, but didn't get any reply: https://stackoverflow.com/questions/63562290/tesseract-ocr-subscript-and-superscript-recognition-problems

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)

Reply all

Reply to author

Forward

0 new messages