CONSULTA

Oscar Orrego

unread,

Oct 4, 2024, 10:59:34 AM10/4/24

to Comunidad DSpace, dspace-a...@listas.mincyt.gob.ar, dspace-commun...@googlegroups.com

Hola tod@s

Tenemos instalados Dspace 9 en un servidor de los datos y queremos levantar para digitalizar la biblioteca de la institución donde trabajo, en las pruebas basicas que realizamos podemos buscar por la metadata, no asi por el CONTENIDO del documento que los usuarios necesitarn buscar palabras descartar otras y demas

Existe alguna configuracion para que indexe por el contenido de cada documento PDF subido con OCR para la busqueda por texto completo

Muchas Gracias

Oscar Orrego

Job Diogenes Ribeiro Borges

unread,

Oct 12, 2024, 10:29:22 AM10/12/24

to DSpace Community

Hola Oscar,

I din't know if there's some specific DSpace settings to do this. But, since, Dspace use Apache SORL for indexing, then this could be achieved.

Look in Google for "SORL OCR PDF indexing"

https://opensemanticsearch.org/doc/admin/config/ocr/

Cheers

Oscar Orrego

unread,

Oct 14, 2024, 9:56:36 AM10/14/24

to Job Diogenes Ribeiro Borges, DSpace Community

Hello Diogenes:
Thank you very much for answering. I already have the files uploaded in PDF OCR already applied. What I need is to be able to search by words within the uploaded OCR file (items). For example, if within the file there is a certain Name "JUAN" you can find it outside the metadata previously entered. Yes within the content of the uploaded OCR file.
Thank you so much

Oscar

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/b026517e-f77c-4386-a138-799328f08b29n%40googlegroups.com.

Abel Gómez

unread,

Oct 14, 2024, 10:13:58 AM10/14/24

to dspace-c...@googlegroups.com

Hi OScar,

if I'm not wrong, full text search on PDFs should be enabled by default if you have configured your DSpace instance to run regularly the media filters (see https://wiki.lyrasis.org/display/DSDOC8x/Scheduled+Tasks+via+Cron, it is referenced in step 15 in the Installation guid of the backend):

https://wiki.lyrasis.org/display/DSDOC8x/Mediafilters+for+Transforming+DSpace+Content

The documentation says explicitly that OCRed documents should work using the "PDF Text Extractor".

Cheers,

Abel

To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/CAEjpp-OmLH3PyKWbTPnO_v8jUUiu2iqBp8GfFJpnX5O1q4%2BzGg%40mail.gmail.com.

-- 
Abel Gómez Llana, PhD

ab...@gomez.llana.me
https://abel.gomez.llana.me

Reply all

Reply to author

Forward