using tesseract for ocr in djvu but needs the boxed pixel position of every word

Jelle de Jong

unread,

Jul 12, 2009, 1:30:00 PM7/12/09

to tesser...@googlegroups.com

Hello everybody,

Hi all, I have been using floss DjVu[1] tools for a year now and made
several tools around them[2] and some are inside Debian[3].

So I wanted to add DjVu OCR support to my systems. There is an
any2djvu[4] server that converts DjVu to DjVu with OCR. This works quite
well but I need to do this with FLOSS[5] tools on my own systems.

So I started testing OCR tools:

apt-cache search ocr
sudo apt-get install djview4
sudo apt-get install gocr ocrad ocropus
sudo apt-get install cuneiform cuneiform-common
sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-nld

djvused ~/document-0003.djvu -e 'n'
ddjvu -format=tiff -mode=black -page=1 ~/document-0003.djvu ~/image-0003.tif
ddjvu -format=pbm -mode=black -page=1 ~/document-0003.djvu ~/image-0003.pbm

gocr -i ~/image-0003.pbm > ~/gocr-0003.txt
ocrad ~/image-0003.pbm > ~/ocrad-0003.txt
ocroscript rec-tess ~/image-0003.pbm > ~/ocroscript-0003.html
tesseract ~/image-0003.tif ~/tesseract-0003 -l nld
cuneiform -l dut -f text -o ~/cuneiform-0003.txt ~/image-0003.tif

tesseract created the best results so I want to use the output and merge
it back inside the DjVu file.

The any2djvu showed my the internal djvu text structure:
djvused ~/document-0003-any2djvu.djvu -e 'select 1; print-pure-txt' >
document-0003-any2djvu.txt
djvused ~/document-0003-any2djvu.djvu -e 'select 1; print-txt' >
document-0003-any2djvu-structure.txt

(page 0 0 4960 7014
(line 478 6163 3067 6354
(word 478 6163 816 6354 "Een")
(word 888 6163 1522 6354 "school")
(word 1604 6163 1750 6354 "is")
(word 1824 6163 2270 6354 "geen")
(word 2350 6163 3067 6354 "kantoor"))

I need to know the position of every line and word inside the tiff
image. It is used to make a selection box around the words so they can
be searched and selected similar as a PDF document.

So can you guys help out and create an output option that outputs this
information or directly outputs to the DjVu text structure[6] style?

djvused ~/document-0003.djvu -e 'select 1; remove-txt' -s
djvused ~/document-0003.djvu -e 'select 1; set-txt
tesseract-0003-djvu.txt' -s

I also saw ocroscript already creates an almost usable html output with
page and line pixel info but no word box information:

ocr_page; bbox 0 0 4960 7014>
ocr_line; bbox 470 921 3318 1038>something with more text

So to summarize I would really like to see a new option like this:
tesseract ~/image-0003.tif ~/tesseract-0003-djvu --lang nld --format
djvu-text (of course something else with the same result is great too)

I attached my source DjVu document so you can reproduce everything I did.

I hope you guys can find some resources to pull this off? If limited
sponsoring is desirable please contact me and I will see what I can arrange.

What are your thoughts around this, will this be doable and in what time
spans?

Many thanks in advance,

Best regards,

Jelle de Jong

[1] http://en.wikipedia.org/wiki/DjVu
[2]
https://secure.powercraft.nl/svn/packages/trunk/source/pct-scanner-scripts/
[3] http://packages.debian.org/sid/pct-scanner-scripts
[4] http://any2djvu.djvuzone.org/
[5] http://en.wikipedia.org/wiki/FLOSS
[6] man djvused | "Hidden text syntax"

document-0003.djvu

Jeffrey Ratcliffe

unread,

Jul 12, 2009, 1:48:32 PM7/12/09

to tesser...@googlegroups.com

2009/7/12 Jelle de Jong <jon...@gmail.com>:

> I hope you guys can find some resources to pull this off? If limited
> sponsoring is desirable please contact me and I will see what I can arrange.

HEAD of gscan2pdf
(http://gscan2pdf.git.sourceforge.net/git/gitweb.cgi?p=gscan2pdf) does
this already.

Regards

Jeff

tuxcrafter

unread,

Jul 12, 2009, 2:18:16 PM7/12/09

to tesseract-ocr

I figured out some more information, It seems ocropus is better suited
to do some structured output. I also checked out the hOCR standard
that should be possible to have word line and page box location
information. So we need some hOCR to djvu-hidden-text convertor.

I also don't know how to get ocroscript (ocropus) use tesseract as OCR
engine, i am using Debian sid/experimental, but the ocroscript support
feels very limited i can almost not find any information on how to use
it.

So if somebody can show me how to use ocropus with tesseract and
create word based hOCR output that can be converted to djvu-hidden-
text format?

Thanks in advance,

Jeffrey Ratcliffe

unread,

Jul 12, 2009, 2:24:02 PM7/12/09

to tesser...@googlegroups.com

2009/7/12 tuxcrafter <jon...@gmail.com>:

> So if somebody can show me how to use ocropus with tesseract and
> create word based hOCR output that can be converted to djvu-hidden-
> text format?

The code I have used in gscan2pdf is

ocroscript $SETTING{ocroscript} --tesslanguage=$SETTING{'ocr
language'} $png > $txt.txt

where

$SETTING{ocroscript} is either 'recognize' or 'rec-tess'

and

$SETTING{'ocr language'} is the language code

gscan2pdf writes the DjVu hidden text automatically.

Regards

Jeff

tuxcrafter

unread,

Jul 12, 2009, 2:24:12 PM7/12/09

to tesseract-ocr

Hi Jeff,

Thanks for you reply.

Could you be more specific on what the head of gscan2pdf already does?

I am not interested in any GUI application, so only CLI tools that do
one job good with documentation and general designed libraries with
documentation that can be used by multiple applications. This way the
resulting solutions will be maintainable and sustainable on all
fronts.

PS also see: http://docs.google.com/View?docid=dfxcv4vc_67g844kf

Jeffrey Ratcliffe

unread,

Jul 12, 2009, 2:32:43 PM7/12/09

to tesser...@googlegroups.com

2009/7/12 tuxcrafter <jon...@gmail.com>:

> Could you be more specific on what the head of gscan2pdf already does?
>
> I am not interested in any GUI application, so only CLI tools that do
> one job good with documentation and general designed libraries with
> documentation that can be used by multiple applications. This way the
> resulting solutions will be maintainable and sustainable on all
> fronts.

gscan2pdf is a GUI that automates scanning, image clean-up, OCR and
saving, e.g. as DjVu or PDF, with the OCR embedded in the file.

tesseract output is simply embedded free-form. ocropus output is
placed at the positions reported by hOCR.

The conversion from hOCR to DjVu hidden text format seemed so trivial
to me that it I didn't think that an extra tool would be generally
useful. If you can read Perl, look at the gscan2pdf source.

Regards

Jeff

tuxcrafter

unread,

Jul 12, 2009, 4:00:36 PM7/12/09

to tesseract-ocr

On Jul 12, 8:32 pm, Jeffrey Ratcliffe <jeffrey.ratcli...@gmail.com>
wrote:
> 2009/7/12 tuxcrafter <jong...@gmail.com>:

The version I had in Debian experimental did not use word elements i
need in my djvu-hidden-text elements, so I tried to get the latest
HEAD of gscan2pdf running, you got a nice system making debs from
make :D

However this version doesn't work very well on my system both
ocroscript as tesseract failed:

$ gscan2pdf
Useless use of sort in void context at /usr/bin/gscan2pdf line 7756.
gscan2pdf 0.9.29
Gdk-CRITICAL **: gdk_x11_atom_to_xatom_for_display: assertion `atom !=
GDK_NONE' failed at /usr/bin/gscan2pdf line 1753.
Use of uninitialized value in concatenation (.) or string at /usr/bin/
gscan2pdf line 9701.
usage: ocroscript [options] [script [args]].
Available options are:
-e stat execute string 'stat'
-l name require library 'name'
-i enter interactive mode after executing 'script'
-v show version information
-- stop handling options
- execute stdin and stop handling options
*** unhandled exception in callback:
*** type Goo::Canvas::Text does not support property 'height' at /
usr/bin/gscan2pdf line 9750.
*** ignoring at /usr/bin/gscan2pdf line 10786.
Use of uninitialized value in concatenation (.) or string at /usr/bin/
gscan2pdf line 9701.
usage: ocroscript [options] [script [args]].
Available options are:
-e stat execute string 'stat'
-l name require library 'name'
-i enter interactive mode after executing 'script'
-v show version information
-- stop handling options
- execute stdin and stop handling options
*** unhandled exception in callback:
*** type Goo::Canvas::Text does not support property 'height' at /
usr/bin/gscan2pdf line 9750.
*** ignoring at /usr/bin/gscan2pdf line 10786.
Tesseract Open Source OCR Engine
*** unhandled exception in callback:
*** type Goo::Canvas::Text does not support property 'height' at /
usr/bin/gscan2pdf line 9750.
*** ignoring at /usr/bin/gscan2pdf line 10786.
Tesseract Open Source OCR Engine
*** unhandled exception in callback:
*** type Goo::Canvas::Text does not support property 'height' at /
usr/bin/gscan2pdf line 9750.
*** ignoring at /usr/bin/gscan2pdf line 10786.
Tesseract Open Source OCR Engine
*** unhandled exception in callback:
*** type Goo::Canvas::Text does not support property 'height' at /
usr/bin/gscan2pdf line 9750.
*** ignoring at /usr/bin/gscan2pdf line 10786.
Gdk-CRITICAL **: gdk_x11_atom_to_xatom_for_display: assertion `atom !=
GDK_NONE' failed at /usr/bin/gscan2pdf line 3091.

jeffrey....@gmail.com

unread,

Jul 13, 2009, 12:29:02 AM7/13/09

to tesser...@googlegroups.com

On Jul 12, 2009 10:00pm, tuxcrafter <jon...@gmail.com> wrote:
> However this version doesn't work very well on my system both
> ocroscript as tesseract failed:

Which version of ocropus are you using?

At the moment, I have only tested gscan2pdf with v0.2, as I haven't managed to get iulib into Debian yet, and it is needed for 0.3 and above.

Regards

Jeff

tuxcrafter

unread,

Jul 13, 2009, 4:09:20 AM7/13/09

to tesseract-ocr

On Jul 13, 6:29 am, jeffrey.ratcli...@gmail.com wrote:

> On Jul 12, 2009 10:00pm, tuxcrafter <jong...@gmail.com> wrote:
>
> > However this version doesn't work very well on my system both
> > ocroscript as tesseract failed:
>
> Which version of ocropus are you using?
>
> At the moment, I have only tested gscan2pdf with v0.2, as I haven't managed
> to get iulib into Debian yet, and it is needed for 0.3 and above.
>
> Regards
>
> Jeff

jeffrey....@gmail.com

unread,

Jul 13, 2009, 5:05:56 AM7/13/09

to tesser...@googlegroups.com

On Jul 13, 2009 10:09am, tuxcrafter <jon...@gmail.com> wrote:
> Are you sure you got a ocr result with all word boxes in the djvu-
> hidden-text structure?

I did it on a per-line basis, as that was the default hOCR output and good enough to get the initial functionality into gscan2pdf. If the word boxes are available, it would be trivial to change.

However, I'm not even sure word boxes are a good idea, as it might defeat multi-word searches.

Regards

Jeff

tuxcrafter

unread,

Jul 13, 2009, 6:11:18 AM7/13/09

to tesseract-ocr

On Jul 13, 11:05 am, jeffrey.ratcli...@gmail.com wrote:

Well the word boxes are very powerful and any2djvu does a pretty good
job at it. If you open a djvu file with word boxes in the hidden-text
you can just select and search through the document as a normal pdf
and that is what I need.

So your new version does not include this feature, so my topic
question remains does somebody have the resources to create a
tesseract hOCR output with page line and word classes so it can be
converted to djvu-hidden-text structure?

Best regards,

Jelle

tuxcrafter

unread,

Jul 15, 2009, 2:53:53 PM7/15/09

to tesseract-ocr

is there some mailing list or bug tracker for these issues, i am
afraid the request will get lost of an only web based groups discuss
system?

Jeffrey Ratcliffe

unread,

Jul 15, 2009, 3:14:44 PM7/15/09

to tesser...@googlegroups.com

2009/7/15 tuxcrafter <jon...@gmail.com>:

> is there some mailing list or bug tracker for these issues, i am
> afraid the request will get lost of an only web based groups discuss
> system?

A quick google shows:

http://code.google.com/p/tesseract-ocr/issues/list

And, BTW, this IS a mailing list.

Jelle de Jong

unread,

Jul 15, 2009, 3:37:52 PM7/15/09

to tesser...@googlegroups.com

Thanks for the link, my google results where kind of different ;-)

I made a feature request:
http://code.google.com/p/tesseract-ocr/issues/detail?id=221

I am more used to the mailman alike systems, but I am trying :)

Best regards,

Jelle

ksa...@gmail.com

unread,

Aug 6, 2013, 6:12:09 PM8/6/13

to tesser...@googlegroups.com

I've made a patch to v3.02.02 adding djvused output ("tessedit_create_djvused" configuration option). It still needs testing with CJR and right-to-left scripts and multipage scans, but it works for single pages in Russian. The patch is attached to my reply to http://code.google.com/p/tesseract-ocr/issues/detail?id=221

Reply all

Reply to author

Forward