Use one lucene index per cluster

100 views
Skip to first unread message

Thorben Heins

unread,
Jun 2, 2016, 9:13:01 AM6/2/16
to Hippo Community
Hi fellow hippostians,

we were wondering if anyone of you ever played around with a shared jackrabbit lucene index for all cluster nodes. The jackrabbit documentation states otherwise, yet it might still work if lucene would be used as a service instance and not as filesystem reference. 

Background: We are tuning our auto scaling cluster on aws and thus try to reduce the bootup time of new nodes.  

Another approach we are evaluating: To keep the delta to be indexed on startup as small as possible, we consider to work with an ec2 snapshot, that is done after the first boot of the application which initially indexes all the jcr nodes. A new release after each sprint for us then is the baked ami including an initialized lucene index. Whenever we need a new instance we can use that snapshot state with a decreased startup time.

Regrads,
Thorben

Bart van der Schans

unread,
Jun 2, 2016, 2:24:44 PM6/2/16
to hippo-c...@googlegroups.com
Hi Thorrben,

You can't share the lucene index between cluster nodes. Jackrabbit,
which is what our repository is built upon, does not support it and it
will break. Each cluster node needs full control over its own index.
This is part of the core of Jackrabbit and not something that can be
changed.

Regards,
Bart
> --
> Hippo Community Group: The place for all discussions and announcements about
> Hippo CMS (and HST, repository etc. etc.)
>
> To post to this group, send email to hippo-c...@googlegroups.com
> RSS:
> https://groups.google.com/group/hippo-community/feed/rss_v2_0_msgs.xml?num=50
> ---
> You received this message because you are subscribed to the Google Groups
> "Hippo Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hippo-communi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/hippo-community.
> For more options, visit https://groups.google.com/d/optout.



--
Hippo B.V. - Oosteinde 11, 1017 WT Amsterdam
Hippo USA, Inc. - 71 Summer Street, 2nd Floor, Boston, MA 02110

US +1 877 414 47 76 (toll free)
NL +31 20 522 44 66
UK +44 20 35 14 99 60
DE +49 69 80 88 40 67

http://www.onehippo.com
http://www.onehippo.org

Ard Schrijvers

unread,
Jun 2, 2016, 3:48:29 PM6/2/16
to hippo-c...@googlegroups.com
Hey Thorben,

First of all, I completely understand your desires/requirements and
also your suggested approaches. See some extra comments on top of the
ones from Bart below

On Thu, Jun 2, 2016 at 8:24 PM, Bart van der Schans
<b.vande...@onehippo.com> wrote:
> Hi Thorrben,
>
> You can't share the lucene index between cluster nodes. Jackrabbit,
> which is what our repository is built upon, does not support it and it
> will break. Each cluster node needs full control over its own index.
> This is part of the core of Jackrabbit and not something that can be
> changed.

Indeed it is correct that Jackrabbit does not support it. Even if
Jackrabbit would support it, then we'd most likely end up with some
cluster nodes having to access a remote filesystem (assuming not all
cluster nodes run on the same machine). And this doesn't work out well
with Lucene. See [1]. Remote index searches are in general too slow.

Wrt to your second approach, there is currently a limitation in
Jackrabbit that makes it at this moment not possible: Jackrabbit keeps
the last lucene index it has in memory (next to a redo log on FS) for
some time (say couple of seconds) and flushes it to disk after some
time has passed or after it contains a certain amount of new indexed
documents. This results however in it that you cannot rely on the
filesystem index only: A snapshot of the filesystem might miss some
part which is kept in memory. So, for now, your second approach also
does not (yet) work. I write (yet) because I've been experimenting
some time ago with supporting taking correct lucene index snapshots
from a running repository. The results seemed quite ok, but it was
still in poc phase. Namely for our on demand offering, we want the
same thing you are after : Quick recovery / rollback / horizontal
scaling if required.

Unfortunately, for you at this moment, it is thus not yet possible.
The only way that is currently supported is as follows : If the
loadbalancer hits, say, two cluster nodes, you can keep a third slave
node up. If it gets more busy, you shut down the third node (then the
in memory index is flushed to disk as well). Then copy the instance to
the fourth cluster node. Start up the third cluster node again and the
fourth. Add the third to the load balancer and keep the fourth running
as slave (the slave gets all the changes and updates the index all the
time).

I understand this is a really poor mans horizontal scaling approach.
That is why I had been working on the poc to be able to take a lucene
index snapshot. Once we have that productized, the second suggestion
you had is the way to go.

HTH,

Regards Ard

[1] https://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc. 71 Summer Street, 2nd Floor Boston, MA 02110, United
states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Thorben Heins

unread,
Jun 3, 2016, 4:20:19 AM6/3/16
to Hippo Community
Hi Bart, 
Hi Ard,

thanks for your super fast replies and detailed information! We are actually going to try this:

1. initial boot of application
2. when /ping returns 200 -> shutdown of tomcat/application
3. create a snapshot of the state of the machine
4. boot cluster nodes with that snapshot state as needed with "minimal" delta on lucene's side

This of course strongly depends on the use case. As it is right now for us, we don't have that many changes in the content between our deployments every two weeks. So the delta is rather small. If there were big changes all the time, the snapshot could be performed regularly i guess. We'll keep that in mind if the we are facing that issue.

If there are any updates on the lucene forced persistence approach we would probably rather use that as it would of course minimize bootup of new instances. 

Just thinking of it: You could regularly (every 6hrs?) let a non online instance boot, create a new snapshot and use that for future scaling. This way the instance would not have to run all the time. Could surely save a couple of $$$ ;-)

btw: Everything that the cluster node needs to become its own cluster node is injected via env variables during boot.

Greetings from Hamburg,
Thorben

Ard Schrijvers

unread,
Jun 3, 2016, 4:27:01 AM6/3/16
to hippo-c...@googlegroups.com
Hey Thorben,

On Fri, Jun 3, 2016 at 10:20 AM, Thorben Heins <ma...@thorben-heins.de> wrote:
> Hi Bart,
> Hi Ard,
>
> thanks for your super fast replies and detailed information! We are actually
> going to try this:
>
> 1. initial boot of application
> 2. when /ping returns 200 -> shutdown of tomcat/application
> 3. create a snapshot of the state of the machine
> 4. boot cluster nodes with that snapshot state as needed with "minimal"
> delta on lucene's side
>
> This of course strongly depends on the use case. As it is right now for us,
> we don't have that many changes in the content between our deployments every
> two weeks. So the delta is rather small. If there were big changes all the
> time, the snapshot could be performed regularly i guess. We'll keep that in
> mind if the we are facing that issue.
>
> If there are any updates on the lucene forced persistence approach we would
> probably rather use that as it would of course minimize bootup of new
> instances.
>
> Just thinking of it: You could regularly (every 6hrs?) let a non online
> instance boot, create a new snapshot and use that for future scaling. This
> way the instance would not have to run all the time. Could surely save a
> couple of $$$ ;-)

That would work out as well indeed for your use case. For a different
use case where we want to have close to instant horizontal scaling
when it gets busy, you'd need a different approach

Regards Ard

Bartosz Oudekerk

unread,
Jun 3, 2016, 7:22:22 AM6/3/16
to Hippo Community
On Fri, 3 Jun 2016 01:20:19 -0700 (PDT)
Thorben Heins <ma...@thorben-heins.de> wrote:

> Hi Bart,
> Hi Ard,
>
> thanks for your super fast replies and detailed information! We are
> actually going to try this:
>
> 1. initial boot of application
> 2. when /ping returns 200 -> shutdown of tomcat/application
> 3. create a snapshot of the state of the machine
> 4. boot cluster nodes with that snapshot state as needed with
> "minimal" delta on lucene's side

When implementing this, please have a look at the repository
maintenance and the lucene part of the backup/restore documentation as
well:

https://www.onehippo.org/library/enterprise/installation-and-configuration/backup-and-restore-strategy.html
https://www.onehippo.org/library/enterprise/installation-and-configuration/repository-maintenance.html

Kind regards,
Bartosz
--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 745 Atlantic Ave, Third Floor, Boston MA 02111

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
http://www.onehippo.com/
Reply all
Reply to author
Forward
0 new messages