Page segmentation output ocr_float

193 views
Skip to first unread message

James Owers

unread,
Jul 6, 2015, 6:52:14 AM7/6/15
to tesser...@googlegroups.com

I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).

  1. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Book Recognition – HBR2013
  2. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013
  3. Breuel (2010) The hOCR Embedded OCR Workflow and Output Format

I've cross-posted this from https://github.com/tesseract-ocr/tesseract/issues/42 and will update both with responses. Which is the default Q&A place?

Rick Leir

unread,
Jul 6, 2015, 2:59:15 PM7/6/15
to tesser...@googlegroups.com
You will see how the hocr file is built with lines like this:
api/baseapi.cpp:        hocr_str.add_str_int("\n    <p class='ocr_par' dir='ltr' id='par_",

Going out on a limb, I grepped the tree for ocr_float, and got no hits. A closer look at the code might turn up something, so have a look.

What I see in api/baseapi.cpp is:
'ocr_page'
'ocr_carea'
'ocr_par'
'ocr_line'
'ocrx_word'

You can also look in api/renderer.cpp :

bool TessHOcrRenderer::BeginDocumentHandler() {
..
      "  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"
      " ocr_line ocrx_word");

James Owers

unread,
Jul 20, 2015, 8:39:19 AM7/20/15
to tesser...@googlegroups.com
Thank you Rick. A concise answer was given on GitHub recently:

jimregan commented 2 days ago

This issue is currently the top search result for 'ocr_float'; it lacks a simple summary: Tesseract (currently) does not support ocr_float.

Reply all
Reply to author
Forward
0 new messages