Page segmentation output ocr

James Owers

unread,

Jul 6, 2015, 6:52:14 AM7/6/15

to tesser...@googlegroups.com

I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).

Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Book Recognition – HBR2013
Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013
Breuel (2010) The hOCR Embedded OCR Workflow and Output Format

I've cross-posted this from https://github.com/tesseract-ocr/tesseract/issues/42 and will update both with responses. Which is the default Q&A place?

Rick Leir

unread,

Jul 6, 2015, 2:59:15 PM7/6/15

to tesser...@googlegroups.com

You will see how the hocr file is built with lines like this:
api/baseapi.cpp: hocr_str.add_str_int("\n <p class='ocr_par' dir='ltr' id='par_",

Going out on a limb, I grepped the tree for ocr_float, and got no hits. A closer look at the code might turn up something, so have a look.

What I see in api/baseapi.cpp is:
'ocr_page'
'ocr_carea'
'ocr_par'
'ocr_line'
'ocrx_word'

You can also look in api/renderer.cpp :

bool TessHOcrRenderer::BeginDocumentHandler() {
..
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"
" ocr_line ocrx_word");

James Owers

unread,

Jul 20, 2015, 8:39:19 AM7/20/15

to tesser...@googlegroups.com

Thank you Rick. A concise answer was given on GitHub recently:

jimregan commented 2 days ago

This issue is currently the top search result for 'ocr_float'; it lacks a simple summary: Tesseract (currently) does not support ocr_float.

Reply all

Reply to author

Forward

Page segmentation output ocr_float

James Owers

Rick Leir

James Owers