I haven't worked significantly on the editor since Tesseract 3.00 was
released, which was when they added support for hOCR. I managed to
build it and create a test document, which is now located in the repository.
I'm not sure why they chose to put the image filename between quotes.
In fact, if you use as input to tesseract an image with a file name that
contains quotes or spaces, it doesn't escape them in any way, making the
quotes quite useless. They're also not mentioned in the hOCR spec, as
far as I can tell.
Nevertheless, I made it so such quotes will be removed if they exist.
I will also look into removing the word-level hOCR data when I get some
free time. I think that really is the best way to solve this problem.
Cheers,
Jim
Thanks for the prerelease version and sorry about the delay. I have
had some time to test it under latest Nightly build for Win32 +
Tesseract 3.01 output, and it it seems to work OK, except that when
saving to hocr image file name is written with html escape sequences
for inverted commas:
<div title="image "image.bmp"; bbox 0 0 3420 4836"
id="page_1" class="ocr_page">
I really can't tell whether this should be considered an error or it
is just fine. The file name is not to be rendered in the "web page",
so there should be no need to escape it, right? Apart from that, it
seems your code handles it correctly when loading the modified file
again.
I will take a look tomorrow to guess about the behaviour of your
checkboxes. By now I presume Tesseract output will neve hace any
paragraph tag.
Finally, it would be fine if you could add some control on top of page
(just as page dropdown menu) to set some zoom level for displayed
images, as I mainly work in DjVu format and do OCR on high resolution
b/w images, which get too large on screen. Just as nice if you could
add unicode support, though I presume this will be slightly harder.
Problem with ANSI is that it requires further conversion to unicode
(is this the standard encoding defined for hocr?), and thus it won't
be able to handle several languages at the same time (ie,
spanish+russian), as their ansi codes for actual language characters
are not compatible.
Thanks for all your time and for this little piece of wonderful software.
inline djvu is not natively supported by mozilla/Firefox (or any other
web browser), so this would be very nontrivial to implement.
I think the remainder of this message (below) belongs on the hOCR
discussion list, which I have included
> To be more precise, what about extending hOCR format to allow links to
> DjVu page fragments instead of including the images? Such links look like this:
>
> http://poliqarp.wbl.klf.uw.edu.pl/extra/linde/index.djvu?djvuopts=&zoom=154&showposition=0.5,0.26&highlight=1190,1840,1016,50&page=p0155.djvu
> http://poliqarp.wbl.klf.uw.edu.pl/extra/linde/index.djvu?djvuopts=&zoom=154&showposition=0.5,0.26&highlight=1183,1791,1025,61&page=p0155.djvu
>
> Of course the common part should be stored only once.
>
> Then the editor may just embed the DjVu fragment in the displayed page
> (the highlight color may be configurable).
>
> If the change to hOCR format is agreed, then I hope Jakub Wilk would
> be willing to extend appropriately his djvu2hocr program bundled with
> ocrodjvu:
>
> http://jwilk.net/software/ocrodjvu
>
> Best regards
>
> Janusz
>
> P.S. Perhaps those links
>
> http://bc.klf.uw.edu.pl/177/
> http://poliqarp.wbl.klf.uw.edu.pl/
>
> may be of some interest to you.
>
> On 05/26/2011 10:04 PM, Janusz S. Bień wrote:
>> On Thu, 26 May 2011 Havjers Havjers <hav...@gmail.com> wrote:
>>
>> [...]
>>
>>> I mainly work in DjVu format
>>
>> What about adapting moz-hocr-edit to work directly with DjVu?
>
> inline djvu is not natively supported by mozilla/Firefox (or any other
> web browser), so this would be very nontrivial to implement.
I don't mean inline djvu, but embedding.
We embed DjVu e.g. on the welcome page of our digital library
and nobody never complained. Please check yourself.
Best regards
Janusz
--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
It should be just fine, as attr='"string"' is equivalent to
attr=""string"". If it wasn't escaped, the quote would
signify the end of the HTML attribute, which is not what we want. (It
is still mysterious to me why tesseract includes the quotes to begin
with, though.)
> I will take a look tomorrow to guess about the behaviour of your
> checkboxes. By now I presume Tesseract output will neve hace any
> paragraph tag.
>
> Finally, it would be fine if you could add some control on top of page
> (just as page dropdown menu) to set some zoom level for displayed
> images, as I mainly work in DjVu format and do OCR on high resolution
> b/w images, which get too large on screen.
This is a planned feature, but I haven't had much time to implement it.
Hopefully I will get a chance to fix it soon, as it really just
involves adding a UI element.
> Just as nice if you could
> add unicode support, though I presume this will be slightly harder.
> Problem with ANSI is that it requires further conversion to unicode
> (is this the standard encoding defined for hocr?), and thus it won't
> be able to handle several languages at the same time (ie,
> spanish+russian), as their ansi codes for actual language characters
> are not compatible.
I can't think of anything that would prevent unicode from working. If a
document's encoding is utf-8 or something similar (e.g. <head> contains
<meta charset=utf-8>), it should work, but I have not tested. Maybe I
should make the editor save in utf-8 format by default, or something...
Can you elaborate on any specific issues you have had?
> Thanks for all your time and for this little piece of wonderful software.
Thank you for the kind words!
Regards,
- Jim
moz-hocr-edit requires that the source image have the flexibility of the
html <img> tag; an embedded "plugin" is not good enough. Specifically,
we need to be able to use the image as the source for a <canvas>
element, which allows the program to display rescaled and cropped images
for each line of text.
> Hi everyone.
>
> When I said I mainly work in DjVu format, I meant this is the format I save my
> scanned books into, so the images and OCR coordinates are based on high
> resolution black and white pictures, and I wouldn't want to convert them to
> lower resolution grey and then perform OCR again. Furthermore, I don't think
> anyone will use hOCR editor to proofread DjVu OCR until word-level coordinates
> will be preserved.
They are preserved at least if you work on FineReader output
converted to DjVu with Jakub Wilk's pdf2djvu
(http://jwilk.net/software/pdf2djvu).
You can check this with our search engine mentioned earlier:
http://poliqarp.wbl.klf.uw.edu.pl/
Without the word-level coordinates we would be unable to highlight hits as
we do now.
When I said I mainly work in DjVu format, I meant this is the format I save my
scanned books into, so the images and OCR coordinates are based on high
resolution black and white pictures, and I wouldn't want to convert them to
lower resolution grey and then perform OCR again. Furthermore, I don't think
anyone will use hOCR editor to proofread DjVu OCR until word-level coordinates
will be preserved. Finally, loading this format would require some plugin as
this is a binary format where hidden text is stored compressed.
On the other hand, hOCR editor is great to produce formatted/reflowable text
from OCRed scanned files, be them in DjVu, PDF, or image format. This is the
main use I am making of it.
Regarding unicode, the problem is ANSI enconding is being used insted. While
ANSI works for one language at at time, provided the right codepage is set, it
won't support multiple languages at once. This happens since, for non-standard
(English) characters, extended characters (codes 129-255) are used. The mapping
for such codes to characters depends on local system/browser codepage.
As an example, the following Spanish characters in codepage Windows-1252:
ñÑáÁéÉíÍóÓúÚüÜ
will be redered, on Russian codepage Windows-1251, as:
сСбБйЙнНуУъЪьЬ
(and vice-versa. so most non-English-based language books are not supported yet)
This issue extends to other OS as well. One essential problem of unicode, I
pressume will be counting the number of actual characters. As for the hOCR file,
simply using UTF8 will suffice.
Jim Garrison wrote:
> On 05/26/2011 10:39 PM, Janusz S. Bień wrote:
>> On Thu, 26 May 2011 Jim Garrison <j...@garrison.cc> wrote:
>>