I usually use Perl for such tasks:
use Mojo::DOM;
open(my $hocr_fh,"<:encoding(UTF-8)",$hocr_file) or die "cannot open $hocr_file: $!";
my $html = '';
while (my $line = <$hocr_fh>) { $html .= $line;}
my $dom = Mojo::DOM->new($html);
my $ocr_page ={};
# <div class='ocr_page' id='page_1' title='image "isisvonoken00oken_0153.png"; bbox 0 0 2321 2817; ppageno 0'>
for my $e ($dom->find('div.ocr_page')->each) {
my $title = $e->{'title'};
print 'page title: ',$title,"\n";
if ($title =~ m/bbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/) {
$ocr_page->{x1} = $1;
$ocr_page->{y1} = $2;
$ocr_page->{x2} = $3;
$ocr_page->{y2} = $4;
}
}
Mojo::DOM is an XML parser allowing to navigate by CSS-selectors (like jQuery).
Of course, there are dozens of other XML parsers available in Perl.
I'm sure, there are similar parsers usable via PHP.
Helmut Wollmersdorfer