automatic text recognition (ocr)

92 views
Skip to first unread message

nickthephreak

unread,
May 27, 2009, 9:37:10 AM5/27/09
to ResourceSpace
hi there,

as i am currently in the progress of importing all my images into RS
(wich btw. is THE dam system i was looking for, for many years!!!) , i
was wondering if it would be possible to add text-recognition (ocr) to
RS?
for me this would be an extremely helpful addition to its already
implemented indexing options.

I did a quick search and found Google's open source system "ocropus",
which they use for Google Books.

i don't have programming knowledge, so i don't have any idea if this
would work.

if it could work, would you please take this post as a "feature
request"?

thanks
nick

Dan Huby

unread,
May 27, 2009, 9:53:31 AM5/27/09
to ResourceSpace
If you can find a suitable open source OCR library or command line
tool it shouldn't be too difficult to integrate. There is a function
extract_text() in image_processing.php that is a good place to add
this. It would be a matter of detecting appropriate bitmaps types by
the extension and forwarding it to the OCR command line tool, just
like the other blocks in that function for Word Document (etc.).

OCR projects I would look at are:
http://en.wikipedia.org/wiki/GOCR
http://en.wikipedia.org/wiki/Tesseract_(software)
http://en.wikipedia.org/wiki/Ocrad

I'm not sure about Ocropus... I limited my search to Ubuntu packages.

I hope this helps.

You are welcome to request features but it's very unlikely someone
will come along and develop it for free. But you never know! :)

If you fund development or develop this yourself it would be good to
have it in the base, if you could supply a patch.

Thanks,

Dan

Tom Gleason

unread,
May 27, 2009, 3:58:44 PM5/27/09
to resour...@googlegroups.com
I played with tesseract a bit a few months ago and found it quite
difficult to get results.
I'm sure there is something that could accomplish OCR for
ResourceSpace, but it's
much more complicated than finding a good command to run the program or
creating an intermediate TIFF file (which is also necessary, at least
for tesseract, afaik).
--
Tom

Mathias Hunskår Furevik

unread,
May 27, 2009, 4:18:26 PM5/27/09
to resour...@googlegroups.com
It's possible to use ocropus, which is basically a wrapper for
tesseract-ocr and handles deskewing, multiple columns etc.

It's been a while since I used it, if I remember correctly the biggest
drawbacks were:

--> Slow, used one week to process 20000 pages in png format
--> Results are a bit unpredictable
--> I've only managed to build it under Ubuntu. Tried both Solaris and
Debian, I belive it requires GCC 4.1>.


//mathias

2009/5/27 Tom Gleason <theory...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages