Tesseract hOCR to produce xml, not (x)html, or a tool to simulate this, or a tool to collect the div (etc) attributes

1,051 views
Skip to first unread message

Kim Rönnberg

unread,
Mar 26, 2016, 6:56:17 AM3/26/16
to tesseract-ocr
Is there a way to make Tesseract produce "real" xml instead of the (x)html hOCR produces, ie. to create xml tags like <ocr_page id='page_1' title='...'> instead of "<div class='ocr_page' id='page_1'...", <ocr_area id='...' title='...'> instead of "<div class='ocr_carea' id='block_1_1'..." etc.?

Or is there somewhere a "ready" something with which the (x)html hOCR produces can be converted to a more "easily" xml parseable format, or, even better, a something that would give me the div's, span's and p's gouped per word, line, area and page readily insertable to a (php) array for inserting into a database, of the data format the hOCR produces now?

Like "file_name", "page_nr", "area_id", "line_nr", "word_nr", "word bbox x1 y1 x2 y2", "the word value", for each word? I realise this means a lot of rows (one per word in a document), but this is something I need.

I have spent some days on this, trying to find something that works on php, but have not managed to find anything.

Regards

Kim Rönnberg

jsbien

unread,
Mar 26, 2016, 11:26:11 AM3/26/16
to tesseract-ocr

There are some tools to convert hOCR to XCES (XML Corpus Encoding Format):

 https://bitbucket.org/jwilk/marasca-wbl/

Regards

Janusz

Tom Morris

unread,
Mar 26, 2016, 11:30:51 AM3/26/16
to tesseract-ocr
On Saturday, March 26, 2016 at 6:56:17 AM UTC-4, Kim Rönnberg wrote:
Is there a way to make Tesseract produce "real" xml instead of the (x)html hOCR produces, ie. to create xml tags like <ocr_page id='page_1' title='...'> instead of "<div class='ocr_page' id='page_1'...", <ocr_area id='...' title='...'> instead of "<div class='ocr_carea' id='block_1_1'..." etc.? 

Or is there somewhere a "ready" something with which the (x)html hOCR produces can be converted to a more "easily" xml parseable format, or, even better, a something that would give me the div's, span's and p's gouped per word, line, area and page readily insertable to a (php) array for inserting into a database, of the data format the hOCR produces now?

PHP? Ugh! But that aside, what specific problem is it having parsing the output? XHTML *is* "real" XML.

Like "file_name", "page_nr", "area_id", "line_nr", "word_nr", "word bbox x1 y1 x2 y2", "the word value", for each word? I realise this means a lot of rows (one per word in a document), but this is something I need.

You might be interested in the TSV format that was added recently (but it's not available in a release yet):

Tom

Helmut Wollmersdorfer

unread,
Mar 26, 2016, 2:36:29 PM3/26/16
to tesseract-ocr
I usually use Perl for such tasks:

use Mojo::DOM;
open(my $hocr_fh,"<:encoding(UTF-8)",$hocr_file) or die "cannot open $hocr_file: $!";

my $html = '';
while (my $line = <$hocr_fh>) { $html .= $line;} 

my $dom = Mojo::DOM->new($html);

my $ocr_page ={};
# <div class='ocr_page' id='page_1' title='image "isisvonoken00oken_0153.png"; bbox 0 0 2321 2817; ppageno 0'>
for my $e ($dom->find('div.ocr_page')->each) { 
    my $title = $e->{'title'}; 
    print 'page title: ',$title,"\n";
    if ($title =~ m/bbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/) {
      $ocr_page->{x1} = $1;
      $ocr_page->{y1} = $2;
      $ocr_page->{x2} = $3;
      $ocr_page->{y2} = $4;
    }
}

Mojo::DOM is an XML parser allowing to navigate by CSS-selectors (like jQuery).
Of course, there are dozens of other XML parsers available in Perl.

I'm sure, there are similar parsers usable via PHP.

Helmut Wollmersdorfer
Reply all
Reply to author
Forward
0 new messages