Hi all, I have been using floss DjVu[1] tools for a year now and made
several tools around them[2] and some are inside Debian[3].
So I wanted to add DjVu OCR support to my systems. There is an
any2djvu[4] server that converts DjVu to DjVu with OCR. This works quite
well but I need to do this with FLOSS[5] tools on my own systems.
So I started testing OCR tools:
apt-cache search ocr
sudo apt-get install djview4
sudo apt-get install gocr ocrad ocropus
sudo apt-get install cuneiform cuneiform-common
sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-nld
djvused ~/document-0003.djvu -e 'n'
ddjvu -format=tiff -mode=black -page=1 ~/document-0003.djvu ~/image-0003.tif
ddjvu -format=pbm -mode=black -page=1 ~/document-0003.djvu ~/image-0003.pbm
gocr -i ~/image-0003.pbm > ~/gocr-0003.txt
ocrad ~/image-0003.pbm > ~/ocrad-0003.txt
ocroscript rec-tess ~/image-0003.pbm > ~/ocroscript-0003.html
tesseract ~/image-0003.tif ~/tesseract-0003 -l nld
cuneiform -l dut -f text -o ~/cuneiform-0003.txt ~/image-0003.tif
tesseract created the best results so I want to use the output and merge
it back inside the DjVu file.
The any2djvu showed my the internal djvu text structure:
djvused ~/document-0003-any2djvu.djvu -e 'select 1; print-pure-txt' >
document-0003-any2djvu.txt
djvused ~/document-0003-any2djvu.djvu -e 'select 1; print-txt' >
document-0003-any2djvu-structure.txt
(page 0 0 4960 7014
(line 478 6163 3067 6354
(word 478 6163 816 6354 "Een")
(word 888 6163 1522 6354 "school")
(word 1604 6163 1750 6354 "is")
(word 1824 6163 2270 6354 "geen")
(word 2350 6163 3067 6354 "kantoor"))
I need to know the position of every line and word inside the tiff
image. It is used to make a selection box around the words so they can
be searched and selected similar as a PDF document.
So can you guys help out and create an output option that outputs this
information or directly outputs to the DjVu text structure[6] style?
djvused ~/document-0003.djvu -e 'select 1; remove-txt' -s
djvused ~/document-0003.djvu -e 'select 1; set-txt
tesseract-0003-djvu.txt' -s
I also saw ocroscript already creates an almost usable html output with
page and line pixel info but no word box information:
ocr_page; bbox 0 0 4960 7014>
ocr_line; bbox 470 921 3318 1038>something with more text
So to summarize I would really like to see a new option like this:
tesseract ~/image-0003.tif ~/tesseract-0003-djvu --lang nld --format
djvu-text (of course something else with the same result is great too)
I attached my source DjVu document so you can reproduce everything I did.
I hope you guys can find some resources to pull this off? If limited
sponsoring is desirable please contact me and I will see what I can arrange.
What are your thoughts around this, will this be doable and in what time
spans?
Many thanks in advance,
Best regards,
Jelle de Jong
[1] http://en.wikipedia.org/wiki/DjVu
[2]
https://secure.powercraft.nl/svn/packages/trunk/source/pct-scanner-scripts/
[3] http://packages.debian.org/sid/pct-scanner-scripts
[4] http://any2djvu.djvuzone.org/
[5] http://en.wikipedia.org/wiki/FLOSS
[6] man djvused | "Hidden text syntax"
HEAD of gscan2pdf
(http://gscan2pdf.git.sourceforge.net/git/gitweb.cgi?p=gscan2pdf) does
this already.
Regards
Jeff
The code I have used in gscan2pdf is
ocroscript $SETTING{ocroscript} --tesslanguage=$SETTING{'ocr
language'} $png > $txt.txt
where
$SETTING{ocroscript} is either 'recognize' or 'rec-tess'
and
$SETTING{'ocr language'} is the language code
gscan2pdf writes the DjVu hidden text automatically.
Regards
Jeff
gscan2pdf is a GUI that automates scanning, image clean-up, OCR and
saving, e.g. as DjVu or PDF, with the OCR embedded in the file.
tesseract output is simply embedded free-form. ocropus output is
placed at the positions reported by hOCR.
The conversion from hOCR to DjVu hidden text format seemed so trivial
to me that it I didn't think that an extra tool would be generally
useful. If you can read Perl, look at the gscan2pdf source.
Regards
Jeff
A quick google shows:
http://code.google.com/p/tesseract-ocr/issues/list
And, BTW, this IS a mailing list.
Thanks for the link, my google results where kind of different ;-)
I made a feature request:
http://code.google.com/p/tesseract-ocr/issues/detail?id=221
I am more used to the mailman alike systems, but I am trying :)
Best regards,
Jelle