Full-text indexing for Office documents?

534 views
Skip to first unread message

Pekka Gaiser

unread,
Aug 12, 2011, 8:46:11 AM8/12/11
to resour...@googlegroups.com
Hi all,
I installed and activated the experimental unoconv-based OpenOffice
integration as described here:
http://wiki.resourcespace.org/index.php/OpenOffice_Integration
ResourceSpace now creates thumbnail previews of Word documents which is
great, however no content seems to be loaded into the full-text index.
The field showing the text contents extracted from the document remains
empty.

Does anybody know why?
Does the Unoconv integration enable thumbnails only?

Thanks,
Pekka

safmon

unread,
Aug 15, 2011, 7:46:52 AM8/15/11
to ResourceSpace
Hi Pekka,

What field do you use for the extracted text?
Have you set the following (in config.php):

# When extracting text from documents (e.g. HTML, DOC, TXT, PDF) which
field is used for the actual content?
# Comment out the line to prevent extraction of text content
$extracted_text_field=72;

Pekka Gaiser

unread,
Aug 19, 2011, 12:51:58 PM8/19/11
to resour...@googlegroups.com
Hi Safmon, thanks for your reply and sorry for the delay!

> What field do you use for the extracted text?
> Have you set the following (in config.php):

That line wasn't there, so I copied it over from the sample config file.
However, the behaviour seems unchanged: ODT, DOCX and PDF documents do
get extracted - I can see the extracted text in the memo field.
Only old-style .DOC documents do not get indexed.
Is this by design?
The OpenOffice running on the server is a 3.x.

Tom Gleason

unread,
Aug 19, 2011, 1:41:43 PM8/19/11
to resour...@googlegroups.com
please try SVN revision 2881. I've modified the code to extract text from the generated PDF in the case of unoconv.
http://svn.montala.net/websvn/revision.php?repname=ResourceSpace&path=%2F&rev=2881&peg=2881

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespace+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.




--
Tom Gleason, PHP Developer

ResourceSpace Support Services
https://www.buildadam.com

Pekka Gaiser

unread,
Aug 19, 2011, 3:39:17 PM8/19/11
to resour...@googlegroups.com
Ohhhh man!
Turns out the files I was importing weren't Word documents at all, but plain text files.
Word Viewer would happily open and display them, which is why I never got suspicious - until just now.
It works with proper .doc files now.
Terribly sorry to have wasted your time and thanks anyway for your ingenious patch idea!
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.




--
Tom Gleason, PHP Developer

ResourceSpace Support Services
https://www.buildadam.com

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.


-- 
--------------------------------------
Taunusstr. 55
51105 Köln
0221-82 82 47 44
Reply all
Reply to author
Forward
0 new messages