Re-extract text (documents)

134 views
Skip to first unread message

Juergen

unread,
Jan 2, 2014, 8:01:46 AM1/2/14
to resour...@googlegroups.com
Due to a server problem, we have some thousand pdf-documents uploaded, but the field id=72 (Extracted text) is not filled. This happened while uploading from different clients and a running staticsync with some thousand images and document. (Older pdf are fine)

I am not sure how to handle this with reindex.php. Is there a way to re-extract the text from the pdfs only?

I found this, but I am not sure, this is the right way: https://groups.google.com/forum/#!msg/resourcespace/cPTGszqFwZo/R14NI-ccVQgJ

Thanks for any advice.

Juergen

Allison Stec

unread,
Jan 2, 2014, 8:32:58 AM1/2/14
to ResourceSpace
You're going to want to use update_exiftool_field.php to re-extract the text before using a reindexing script.
You can pass the field you want to update by adding "&fieldrefs=72" to the end of the url.
You can also focus the script on a particular collection with "&col=<collectionID#>"

With thousands of documents on your list, you may want to try extracting from a few resources first to verify that pdf text extraction is working properly.

Allison Stec
Asset Management Specialist
Colorhythm
http://www.colorhythm.com

Main Office: +1 415-399-9921
Fax: +1 253-399-9928

as...@colorhythm.com


--
ResourceSpace: Open Source Digital Asset Management
http://www.resourcespace.org
---
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to resourcespac...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Juergen

unread,
Jan 2, 2014, 12:00:23 PM1/2/14
to resour...@googlegroups.com
Allison,

thanks for this hint. When I try (collectionID = 44)

http://myrs.net/pages/tools/update_exiftool_field.php?fieldrefs=72&col=44

I get this output in the browser: "Please add an exiftool mapping to your Extracted text Field"

and none text is extracted. In the collection is one pdf only.

Kind regards,

Juergen

Allison Stec

unread,
Jan 2, 2014, 12:26:14 PM1/2/14
to ResourceSpace
My mistake. With exiftool being so robust i often forget that it's not responsible for text extraction.

I don't think there's anything within RS that will do this for you. You may need to consider creating a small plugin that utilizes the function "extract_text".

Allison Stec
Asset Management Specialist
Colorhythm
http://www.colorhythm.com

Main Office: +1 415-399-9921
Fax: +1 253-399-9928

as...@colorhythm.com


--

Juergen

unread,
Jan 2, 2014, 12:44:36 PM1/2/14
to resour...@googlegroups.com
Allison,

Don't mind - nevertheless, thank you for your time.
Actually I was a bit confused because of the exiftool. I thought maybe I missed a correlation.
Is this problem never occurred before? It concerns at least the basic functionality of RS.

Any idea is appreciated.

Kind regards,

Juergen

Reply all
Reply to author
Forward
0 new messages