hOCR editor not Tesseract compatible and not word copatible

352 views
Skip to first unread message

Havjers Havjers

unread,
May 22, 2011, 8:21:56 AM5/22/11
to moz-hocr-edit
Hello.

Thanks for your software, which fills the vast missing area in the
field of proofreading tools.

However, I would still like to point out some issues I have found when
working with hOCR files as generated by Tesseract.

Firstly, image file name is specified in hOCR file between double
inverted commas ("file.ext"). This makes your hOCR editor produce
file://path/"file.ext", which won't get the images loaded.

Secondly, it seems your software won't support word level definition,
so with Tesseract hOCR it will simply display many <span> tags with
word coordinates inside the line text box. Is there any chance to
modify your software so that it will be able to load word-level hOCR
files, and save at least in line mode?

Thanks.

Jim Garrison

unread,
May 23, 2011, 2:01:51 PM5/23/11
to Havjers Havjers, moz-ho...@googlegroups.com
Thank you for the comments.

I haven't worked significantly on the editor since Tesseract 3.00 was
released, which was when they added support for hOCR. I managed to
build it and create a test document, which is now located in the repository.

I'm not sure why they chose to put the image filename between quotes.
In fact, if you use as input to tesseract an image with a file name that
contains quotes or spaces, it doesn't escape them in any way, making the
quotes quite useless. They're also not mentioned in the hOCR spec, as
far as I can tell.

Nevertheless, I made it so such quotes will be removed if they exist.

I will also look into removing the word-level hOCR data when I get some
free time. I think that really is the best way to solve this problem.

Cheers,
Jim

Havjers Havjers

unread,
May 25, 2011, 9:19:32 PM5/25/11
to Jim Garrison
Hello.

Thanks for the prerelease version and sorry about the delay. I have
had some time to test it under latest Nightly build for Win32 +
Tesseract 3.01 output, and it it seems to work OK, except that when
saving to hocr image file name is written with html escape sequences
for inverted commas:

<div title="image &quot;image.bmp&quot;; bbox 0 0 3420 4836"
id="page_1" class="ocr_page">

I really can't tell whether this should be considered an error or it
is just fine. The file name is not to be rendered in the "web page",
so there should be no need to escape it, right? Apart from that, it
seems your code handles it correctly when loading the modified file
again.

I will take a look tomorrow to guess about the behaviour of your
checkboxes. By now I presume Tesseract output will neve hace any
paragraph tag.

Finally, it would be fine if you could add some control on top of page
(just as page dropdown menu) to set some zoom level for displayed
images, as I mainly work in DjVu format and do OCR on high resolution
b/w images, which get too large on screen. Just as nice if you could
add unicode support, though I presume this will be slightly harder.
Problem with ANSI is that it requires further conversion to unicode
(is this the standard encoding defined for hocr?), and thus it won't
be able to handle several languages at the same time (ie,
spanish+russian), as their ansi codes for actual language characters
are not compatible.

Thanks for all your time and for this little piece of wonderful software.

Jim Garrison

unread,
May 27, 2011, 1:28:32 AM5/27/11
to jsb...@mimuw.edu.pl, Havjers Havjers, Jakub Wilk, moz-hocr-edit, ho...@googlegroups.com
On 05/26/2011 10:04 PM, Janusz S. Bień wrote:
> On Thu, 26 May 2011 Havjers Havjers <hav...@gmail.com> wrote:
>
> [...]

>
>> I mainly work in DjVu format
>
> What about adapting moz-hocr-edit to work directly with DjVu?

inline djvu is not natively supported by mozilla/Firefox (or any other
web browser), so this would be very nontrivial to implement.

I think the remainder of this message (below) belongs on the hOCR
discussion list, which I have included

> To be more precise, what about extending hOCR format to allow links to
> DjVu page fragments instead of including the images? Such links look like this:
>
> http://poliqarp.wbl.klf.uw.edu.pl/extra/linde/index.djvu?djvuopts=&zoom=154&showposition=0.5,0.26&highlight=1190,1840,1016,50&page=p0155.djvu
> http://poliqarp.wbl.klf.uw.edu.pl/extra/linde/index.djvu?djvuopts=&zoom=154&showposition=0.5,0.26&highlight=1183,1791,1025,61&page=p0155.djvu
>
> Of course the common part should be stored only once.
>
> Then the editor may just embed the DjVu fragment in the displayed page
> (the highlight color may be configurable).
>
> If the change to hOCR format is agreed, then I hope Jakub Wilk would
> be willing to extend appropriately his djvu2hocr program bundled with
> ocrodjvu:
>
> http://jwilk.net/software/ocrodjvu
>
> Best regards
>
> Janusz
>
> P.S. Perhaps those links
>
> http://bc.klf.uw.edu.pl/177/
> http://poliqarp.wbl.klf.uw.edu.pl/
>
> may be of some interest to you.
>

Janusz S. Bień

unread,
May 27, 2011, 1:39:41 AM5/27/11
to Jim Garrison, Havjers Havjers, Jakub Wilk, moz-hocr-edit, ho...@googlegroups.com
On Thu, 26 May 2011 Jim Garrison <j...@garrison.cc> wrote:

> On 05/26/2011 10:04 PM, Janusz S. Bień wrote:
>> On Thu, 26 May 2011 Havjers Havjers <hav...@gmail.com> wrote:
>>
>> [...]
>>
>>> I mainly work in DjVu format
>>
>> What about adapting moz-hocr-edit to work directly with DjVu?
>
> inline djvu is not natively supported by mozilla/Firefox (or any other
> web browser), so this would be very nontrivial to implement.

I don't mean inline djvu, but embedding.

We embed DjVu e.g. on the welcome page of our digital library

http://bc.klf.uw.edu.pl/

and nobody never complained. Please check yourself.

Best regards

Janusz

--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Jim Garrison

unread,
May 27, 2011, 1:40:30 AM5/27/11
to Havjers Havjers, moz-hocr-edit
On 05/25/2011 06:19 PM, Havjers Havjers wrote:
> Hello.
>
> Thanks for the prerelease version and sorry about the delay. I have
> had some time to test it under latest Nightly build for Win32 +
> Tesseract 3.01 output, and it it seems to work OK, except that when
> saving to hocr image file name is written with html escape sequences
> for inverted commas:
>
> <div title="image &quot;image.bmp&quot;; bbox 0 0 3420 4836"
> id="page_1" class="ocr_page">
>
> I really can't tell whether this should be considered an error or it
> is just fine. The file name is not to be rendered in the "web page",
> so there should be no need to escape it, right? Apart from that, it
> seems your code handles it correctly when loading the modified file
> again.

It should be just fine, as attr='"string"' is equivalent to
attr="&quot;string&quot;". If it wasn't escaped, the quote would
signify the end of the HTML attribute, which is not what we want. (It
is still mysterious to me why tesseract includes the quotes to begin
with, though.)

> I will take a look tomorrow to guess about the behaviour of your
> checkboxes. By now I presume Tesseract output will neve hace any
> paragraph tag.
>
> Finally, it would be fine if you could add some control on top of page
> (just as page dropdown menu) to set some zoom level for displayed
> images, as I mainly work in DjVu format and do OCR on high resolution
> b/w images, which get too large on screen.

This is a planned feature, but I haven't had much time to implement it.
Hopefully I will get a chance to fix it soon, as it really just
involves adding a UI element.

> Just as nice if you could
> add unicode support, though I presume this will be slightly harder.
> Problem with ANSI is that it requires further conversion to unicode
> (is this the standard encoding defined for hocr?), and thus it won't
> be able to handle several languages at the same time (ie,
> spanish+russian), as their ansi codes for actual language characters
> are not compatible.

I can't think of anything that would prevent unicode from working. If a
document's encoding is utf-8 or something similar (e.g. <head> contains
<meta charset=utf-8>), it should work, but I have not tested. Maybe I
should make the editor save in utf-8 format by default, or something...

Can you elaborate on any specific issues you have had?

> Thanks for all your time and for this little piece of wonderful software.

Thank you for the kind words!

Regards,
- Jim

Jim Garrison

unread,
May 27, 2011, 1:45:39 AM5/27/11
to jsb...@mimuw.edu.pl, Havjers Havjers, Jakub Wilk, moz-hocr-edit
On 05/26/2011 10:39 PM, Janusz S. Bieďż˝ wrote:
> On Thu, 26 May 2011 Jim Garrison <j...@garrison.cc> wrote:
>
>> On 05/26/2011 10:04 PM, Janusz S. Bieďż˝ wrote:
>>> On Thu, 26 May 2011 Havjers Havjers <hav...@gmail.com> wrote:
>>>
>>> [...]
>>>
>>>> I mainly work in DjVu format
>>>
>>> What about adapting moz-hocr-edit to work directly with DjVu?
>>
>> inline djvu is not natively supported by mozilla/Firefox (or any other
>> web browser), so this would be very nontrivial to implement.
>
> I don't mean inline djvu, but embedding.
>
> We embed DjVu e.g. on the welcome page of our digital library
>
> http://bc.klf.uw.edu.pl/
>
> and nobody never complained. Please check yourself.

moz-hocr-edit requires that the source image have the flexibility of the
html <img> tag; an embedded "plugin" is not good enough. Specifically,
we need to be able to use the image as the source for a <canvas>
element, which allows the program to display rescaled and cropped images
for each line of text.

Janusz S. Bień

unread,
May 27, 2011, 7:30:16 AM5/27/11
to Gaspar Llamazares, Jim Garrison, Havjers Havjers, Jakub Wilk, moz-hocr-edit
On Fri, 27 May 2011 Gaspar Llamazares <gaspar.l...@gmail.com> wrote:

> Hi everyone.
>
> When I said I mainly work in DjVu format, I meant this is the format I save my
> scanned books into, so the images and OCR coordinates are based on high
> resolution black and white pictures, and I wouldn't want to convert them to
> lower resolution grey and then perform OCR again. Furthermore, I don't think
> anyone will use hOCR editor to proofread DjVu OCR until word-level coordinates
> will be preserved.

They are preserved at least if you work on FineReader output
converted to DjVu with Jakub Wilk's pdf2djvu
(http://jwilk.net/software/pdf2djvu).

You can check this with our search engine mentioned earlier:

http://poliqarp.wbl.klf.uw.edu.pl/

Without the word-level coordinates we would be unable to highlight hits as
we do now.

Gaspar Llamazares

unread,
May 27, 2011, 7:08:01 AM5/27/11
to Jim Garrison, jsb...@mimuw.edu.pl, Havjers Havjers, Jakub Wilk, moz-hocr-edit
Hi everyone.

When I said I mainly work in DjVu format, I meant this is the format I save my
scanned books into, so the images and OCR coordinates are based on high
resolution black and white pictures, and I wouldn't want to convert them to
lower resolution grey and then perform OCR again. Furthermore, I don't think
anyone will use hOCR editor to proofread DjVu OCR until word-level coordinates

will be preserved. Finally, loading this format would require some plugin as
this is a binary format where hidden text is stored compressed.

On the other hand, hOCR editor is great to produce formatted/reflowable text
from OCRed scanned files, be them in DjVu, PDF, or image format. This is the
main use I am making of it.

Regarding unicode, the problem is ANSI enconding is being used insted. While
ANSI works for one language at at time, provided the right codepage is set, it
won't support multiple languages at once. This happens since, for non-standard
(English) characters, extended characters (codes 129-255) are used. The mapping
for such codes to characters depends on local system/browser codepage.

As an example, the following Spanish characters in codepage Windows-1252:
ñÑáÁéÉíÍóÓúÚüÜ
will be redered, on Russian codepage Windows-1251, as:
сСбБйЙнНуУъЪьЬ
(and vice-versa. so most non-English-based language books are not supported yet)

This issue extends to other OS as well. One essential problem of unicode, I
pressume will be counting the number of actual characters. As for the hOCR file,
simply using UTF8 will suffice.

Jim Garrison wrote:


> On 05/26/2011 10:39 PM, Janusz S. Bień wrote:
>> On Thu, 26 May 2011 Jim Garrison <j...@garrison.cc> wrote:
>>

Reply all
Reply to author
Forward
0 new messages