OCR Engine

Chandrika Hebbar

unread,

Aug 8, 2023, 12:03:27 PM8/8/23

to DSpace Community

Hi Team,

Would like to know about the underlying OCR engine used in DSpace. Please share if there is any documentation around the same.

Regards,

Chandrika Hebbar

DSpace Community

unread,

Aug 16, 2023, 3:44:38 PM8/16/23

to DSpace Community

Hi Chandrika,

DSpace does not have an OCR engine. It is only able to index PDFs (or other electronic files) if they have been previously OCR'ed by a different system.

Tim

Yvonne

unread,

Aug 16, 2023, 6:57:48 PM8/16/23

to DSpace Community

Thank you both. I found this helpful to know!

Best regards,

Yvonne

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/177ae4c4-a59f-4bdb-af87-0e2ce03e1582n%40googlegroups.com.

Mark H. Wood

unread,

Aug 17, 2023, 9:29:06 AM8/17/23

to dspace-c...@googlegroups.com

On Wed, Aug 16, 2023 at 12:44:38PM -0700, DSpace Community wrote:
> DSpace does not have an OCR engine. It is only able to index PDFs (or
> other electronic files) if they have been previously OCR'ed by a different
> system.

Or if they contained machine-readable text to begin with.

So: a PDF that was rendered from a word-processing document (for
example) probably contains text that can be flattened and indexed. A
PDF which contains images of paper documents will not, unless the
imaging software or some other tool has OCRed the images and added a
text layer to the PDF.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc

Hardy Pottinger

unread,

Aug 18, 2023, 11:54:23 AM8/18/23

to Yvonne, DSpace Community

Hi, I thought I'd chime in here to say, everyone who has responded is correct: there is currently no OCR functionality within DSpace. However, DSpace does utilize Apache Tika to feed the fulltext search index, and Tika does also support OCR functionality (via Tesseract OCR). To be clear, there's no OCR capability within DSpace... yet... but someone could build it, if they were keen to do so.

One word of caution to developers who want to tackle this job: I've seen Tesseract OCR severely impact another software's throughput... You'd have to engineer carefully to avoid running into the same problem.

--Hardy

To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/CAKZKP2BmUT5%2BisTZq4FxZ9OO%2BXN3g9-Fuisgz6VORYyaof9A5w%40mail.gmail.com.

Reply all

Reply to author

Forward