subscript and superscript recognition problems

84 views
Skip to first unread message

An Keilha

unread,
Aug 31, 2020, 8:40:41 AM8/31/20
to tesseract-ocr

A colleage and I are having problems with recognition of subscript and superscript. We posted our problem on StackOverflow, but didn't get any reply: https://stackoverflow.com/questions/63562290/tesseract-ocr-subscript-and-superscript-recognition-problems

"

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

example.png

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

  • tessedit_create_hocr = 1 (to get result as HOCR)
  • hocr_font_info = 1 (to get additional font infos like font size)
  • hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

  • "Subtext<sub>Sub</sub>" is recognized as "Subtextsu,"
  • "Suptext<sup>Sub</sup>" is recognized as "Suptexts?"
  • "P<sub>0</sub>" is recognized as "Po"
  • "P<sub>100</sub>" is recognized as "P1go"
  • "a<sup>2</sup>+<sup>b2</sup>" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

  1. optimize subscript/superscript handling
  2. get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
"
Reply all
Reply to author
Forward
0 new messages