FineREader XML to hOCR converter

595 views
Skip to first unread message

cma...@googlemail.com

unread,
Dec 2, 2007, 12:27:39 PM12/2/07
to ocropus
Hi Thomas,
I've uploaded a prototype of a Abbyy Finereader XML to hOCR converter
to the file section. It's tested with the Output of the FineReader OCR
engine version 8. The official schema is available under [1]. It uses
the SAX2 API since these files can get very huge (I have files up to
400 MB for a single book).

There are several issues I'm aware of:
- The language mapping is far from being complete
- Font settings are ignored
- Columns are not supported
- The Output lacks the document type declaration and namespace
definitions (standard python class "XMLGenarator" is not very useful)
- The code is not very well documented

Some questions remain:
- I don't know python very good: Can you recommend a python library
for event based XML creation?
- Am I right that this fragment of a table would be valid hOCR to
(<span>'s are missing)?
<table><tr>
<th class="ocr_line" title="bbox 734 1811 1459 1891">Ungekeimte
Erbsen.</th>
<th class="ocr_line" title="bbox 1556 1806 1941 1871">1. Periode.</th>
<th class="ocr_line" title="bbox 2033 1806 2414 1870">2. Periode.</th>
</tr><tr>
<td class="ocr_line" title="bbox 473 1909 1166 1986">Fett
2,27</td>
<td class="ocr_line" title="bbox 1704 1912 1839 1987">2,32</td>
<td class="ocr_line" title="bbox 2148 1912 2288 1985">2,20</td>
</tr></table>


Please let me know what you think.

Cheers,
Christian

[1] http://www.abbyy.com/FineReader_xml/FineReader8-schema-v2.xml

Thomas Breuel

unread,
Dec 3, 2007, 9:05:08 AM12/3/07
to ocr...@googlegroups.com
Hi,

I've uploaded a prototype of a Abbyy Finereader XML to hOCR converter
to the file section. It's tested with the Output of the FineReader OCR
engine version 8. The official schema is available under [1]. It uses
the SAX2 API since these files can get very huge (I have files up to
400 MB for a single book).

That's great news.  Yes, there are few applications that require it these days, but Abbyy XML may be one of them...

There are several issues I'm aware of:
- The language mapping is far from being complete
- Font settings are ignored
- Columns are not supported

We should be able to get those from ocrx_block elements.
 

- The Output lacks the document type declaration and namespace
definitions (standard python class "XMLGenarator" is not very useful)

You might be able to just insert those directly into the output.

Some questions remain:
- I don't know python very good: Can you recommend a python library
for event based  XML creation?

I think people generally just use "print" for that.
 

- Am I right that this fragment of a table would be valid hOCR to
(<span>'s are missing)?
<table><tr>
<th class="ocr_line" title="bbox 734 1811 1459 1891">Ungekeimte
Erbsen.</th>
<th class="ocr_line" title="bbox 1556 1806 1941 1871">1. Periode.</th>
<th class="ocr_line" title="bbox 2033 1806 2414 1870">2. Periode.</th>
</tr><tr>
<td class="ocr_line" title="bbox 473 1909 1166 1986">Fett
2,27</td>
<td class="ocr_line" title="bbox 1704 1912 1839 1987">2,32</td>
<td class="ocr_line" title="bbox 2148 1912 2288 1985">2,20</td>
</tr></table>

That looks like a good way of representing that content.  Eventually, it might be a good idea to actually specify elements for tables in hOCR.

Cheers,
Thomas.

cma...@googlemail.com

unread,
Dec 3, 2007, 12:58:38 PM12/3/07
to ocropus
Hi Thomas,
> I've uploaded a prototype of a Abbyy Finereader XML to hOCR converter
>
> > to the file section. It's tested with the Output of the FineReader OCR
> > engine version 8. The official schema is available under [1]. It uses
> > the SAX2 API since these files can get very huge (I have files up to
> > 400 MB for a single book).
>
> That's great news. Yes, there are few applications that require it these
> days, but Abbyy XML may be one of them...
I have to admit that I'm a big fan of stream/event based XML APIs,
since they scale much better then a tree approach.

> There are several issues I'm aware of:
>
> > - The language mapping is far from being complete
> > - Font settings are ignored
> > - Columns are not supported
>
> We should be able to get those from ocrx_block elements.
Yes, I'm planing to add a more hOCR like object model to the script.
This should make the needed content analysis easier.

> > - The Output lacks the document type declaration and namespace
> > definitions (standard python class "XMLGenarator" is not very useful)
>
> You might be able to just insert those directly into the output.
I'm looking for some kind of XML API , see below.

> > - I don't know python very good: Can you recommend a python library
> > for event based XML creation?
>
> I think people generally just use "print" for that.
I know, but I like to have a API for that. This generally reduce the
number of possible XML errors (namespaces, wellformedness), since
others have already this work for you. I'm also asking this question
to avoid external dependencies.

Maybe I should just switch to the DOM API. This way it's possible to
move the XML serialisation out of the parsing event loop into a object
representation of hOCR.

> > - Am I right that this fragment of a table would be valid hOCR to
> > (<span>'s are missing)?
> > <table><tr>
> > <th class="ocr_line" title="bbox 734 1811 1459 1891">Ungekeimte
> > Erbsen.</th>
> > <th class="ocr_line" title="bbox 1556 1806 1941 1871">1. Periode.</th>
> > <th class="ocr_line" title="bbox 2033 1806 2414 1870">2. Periode.</th>
> > </tr><tr>
> > <td class="ocr_line" title="bbox 473 1909 1166 1986">Fett
> > 2,27</td>
> > <td class="ocr_line" title="bbox 1704 1912 1839 1987">2,32</td>
> > <td class="ocr_line" title="bbox 2148 1912 2288 1985">2,20</td>
> > </tr></table>
>
> That looks like a good way of representing that content. Eventually, it
> might be a good idea to actually specify elements for tables in hOCR.
It's just a proposal, the current version of the converter creates a
construct like "<td><span class="ocr_line" title="bbox 1 2 3 4">cell
content<span></td>"


I've got yet another hOCR Question:
Do you consider this (partly handcrafted) fragment as valid hOCR
(embedded <span> with style definition)?

<span class="ocr_line" title="bbox 219 3491 817 3558; cuts 47 39 24 28
35 41 47 33 26 34 43 42 43 45 55">halten <span style="font-style:
italic">sich</span> nun:</span>

Cheers,
Christian

Thomas Breuel

unread,
Dec 4, 2007, 6:14:35 PM12/4/07
to ocr...@googlegroups.com

It's just a proposal, the current version of the converter creates a
construct like "<td><span class="ocr_line" title="bbox 1 2 3 4">cell
content<span></td>"

That's fine.  You can put a "class=ocr_..." on almost any regular HTML tag in order to encode OCR information.  However, HTML tables may be used for many purposes, so writing "<td class="ocr_line" title="bbox 1704 1912 1839 1987">2,32</td>" could be a line in a table, but it could also just be a text line that happens to be rendered in a table.

If you want to indicate that something is indeed a table, you can write something like:

<table class="ocr_table"> ... </table>

or

<span class="ocr_table"><table> ... </table></table>


I've got yet another hOCR Question:
Do you consider this (partly handcrafted) fragment as valid hOCR
(embedded <span> with style definition)?

<span class="ocr_line" title="bbox 219 3491 817 3558; cuts 47 39 24 28
35 41 47 33 26 34 43 42 43 45 55">halten <span style="font-style:
italic">sich</span> nun:</span>

It's valid, but if you put the style on something that isn't an hOCR element, then it only affects rendering.  You probably want a "class=ocr_cinfo" in there.

I've tried to clarify this in Section 9.

Cheers,
Thomas.

Reply all
Reply to author
Forward
0 new messages