OCR Searchable pdf content in dspace 7+

35 views
Skip to first unread message

ruchira raul

unread,
Jan 19, 2023, 1:51:29 AM1/19/23
to DSpace Technical Support
I have installed dspace 7.3 on ubuntu 22.04.
Kindly help me to make pdf or other file OCR searchable ie. can be searched with any content in pdf or file and not only from keywords, title, author etc.

Please let me know the settings or steps to be changed in the configuration files.

Tim Donohue

unread,
Jan 23, 2023, 10:55:27 AM1/23/23
to DSpace Technical Support
Hi,

For full-text indexing/searching in DSpace, you need to enable/run the "Media filters": https://wiki.lyrasis.org/display/DSDOC7x/Mediafilters+for+Transforming+DSpace+Content

These are scripts that can extract text out of text-based content (like OCR'd PDFs, Word docs, etc).

Most sites choose to run those on a scheduled basis (e.g. once per day, or a few times a day) via a Cron Job.  See this guide: https://wiki.lyrasis.org/display/DSDOC7x/Scheduled+Tasks+via+Cron

If you have more questions, let us know on this list.

Tim

Reply all
Reply to author
Forward
0 new messages