Binaries indexing

51 views
Skip to first unread message

Gary Carmichael

unread,
Jul 3, 2018, 7:06:46 AM7/3/18
to Hippo Community
Hi,

Recently we've started to run into memory issues when an indexing takes place. What seemed to stick out to us was that the PDF indexing was taking a large amount of time and so we began looking at ways to exclude these.

I found this old community post:

This recommended altering the jackrabbit 'textFilterCasses' entry in the repository.xml, but this seems like it's been deprecated in favour of using Apache Tika.



Setting this 'hippo:text' property to an empty string seems to be working (we don't wish to extract the PDF text at all - we have HTML versions that are indexed instead).

Is this your recommended course of action? Would the same be appropriate for other binaries too (e.g. word docs, excel sheets)?

Ard Schrijvers

unread,
Jul 3, 2018, 9:07:36 AM7/3/18
to hippo-c...@googlegroups.com
Hey Gary,

setting a 'hippo:text' binary property with empty which is empty
indeed does the trick. It must be an empty binary, not an empty string

Regards Ard
> --
> Hippo Community Group: The place for all discussions and announcements about
> Hippo CMS (and HST, repository etc. etc.)
>
> To post to this group, send email to hippo-c...@googlegroups.com
> RSS:
> https://groups.google.com/group/hippo-community/feed/rss_v2_0_msgs.xml?num=50
> ---
> You received this message because you are subscribed to the Google Groups
> "Hippo Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hippo-communi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/hippo-community.
> For more options, visit https://groups.google.com/d/optout.



--
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc. 71 Summer Street, 2nd Floor Boston, MA 02110, United
states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Woonsan Ko

unread,
Jul 3, 2018, 9:18:14 AM7/3/18
to hippo-c...@googlegroups.com
On Tue, Jul 3, 2018 at 7:06 AM, Gary Carmichael <g.carmi...@gmail.com> wrote:
Hi,

Recently we've started to run into memory issues when an indexing takes place. What seemed to stick out to us was that the PDF indexing was taking a large amount of time and so we began looking at ways to exclude these.

I found this old community post:

This recommended altering the jackrabbit 'textFilterCasses' entry in the repository.xml, but this seems like it's been deprecated in favour of using Apache Tika.
Indeed. It was changed to no-op as of https://issues.apache.org/jira/browse/JCR-2885, and removed finally with https://issues.apache.org/jira/browse/JCR-4236.
 



Setting this 'hippo:text' property to an empty string seems to be working (we don't wish to extract the PDF text at all - we have HTML versions that are indexed instead).
If no index is needed for pdf, that is the best approach. hippo:text makes the repository think it's already full-text indexed by the special property.
 

Is this your recommended course of action? Would the same be appropriate for other binaries too (e.g. word docs, excel sheets)?
Yes.
In addition, whether it needs to parse or not is determined by the tika parser configuration (hippo-repository-tika-x.x.x.jar!org/onehippo/repository/tika/tika-config.xml) where parsers for pdf, office docs, etc. are defined as well as EmptyParser (no parsing) for zip, images, etc.
hippo:text trick will be the simplest in your case.

Regards,

Woonsan

--
Hippo Community Group: The place for all discussions and announcements about Hippo CMS (and HST, repository etc. etc.)
 
To post to this group, send email to hippo-community@googlegroups.com

RSS: https://groups.google.com/group/hippo-community/feed/rss_v2_0_msgs.xml?num=50
---
You received this message because you are subscribed to the Google Groups "Hippo Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hippo-community+unsubscribe@googlegroups.com.

Gary Carmichael

unread,
Jul 3, 2018, 11:26:38 AM7/3/18
to Hippo Community
Thanks for your replies, Woonsan and Ard.
Reply all
Reply to author
Forward
0 new messages