[morphia] feat.-req: fulltext search emulation

82 views
Skip to first unread message

Uwe Schäfer

unread,
May 9, 2010, 2:35:31 PM5/9/10
to mor...@googlegroups.com
hi

>> - preparation of fulltext-index (creating keyword list with stemming,
>> stop-word filtering)

i´ve seen some issue mentioning compass. while compass is a great
project and Shay is a very nice, helpful and clever guy, i wonder if
compass integration is the obvious choice.

yes, lucene is good when it does not have to scale, but if it has to -
oh boy. we´re having roughly a million articles here that make a 1g
index which is
a) to slow for certain problems
b) impossible to failover.
JDBC-Directories are incredibly slow and even the terracotta stuff did
not work for us.
i´ve recently seen hazelcast integration, but i´d stay away from the
hassle if possible.

the other problem is: if the index is not next to the database,
consistent backups are a matter of daydreams.

i looked into ft-indexes at the mongo website, and first tests with
1million articles stemmed and cleared from stopwords made mongo query
for 300-400ms matching every article on a single server instance.

this way fulltext searches are integrateable into filtering queries,
they would scale with the database, would be fail-over ready etc...

sounds good to me. is there a reason not to go this way?

cu uwe

PS: remember i´m a mongo rookie.
PPS: could be easily added if there was a lifecycle listener ;)

Scott Hernandez

unread,
May 9, 2010, 4:46:05 PM5/9/10
to mor...@googlegroups.com
Yep, this sounds less like a feature and more like an example you can
roll yourself. Making it a feature means adding deps. for whatever
stem'r and parser libraries we would use.

Do you think a wiki-page would suffice?

2010/5/9 Uwe Schäfer <u...@thomas-daily.de>:

Ólafur Gauti Guðmundsson

unread,
May 10, 2010, 6:19:46 AM5/10/10
to Morphia
Hi,
I agree with you that I'm not sure the Compass integration is a good
fit. For my full text searching needs I use SOLR and I've also used
ElasticSearch (which is implemented by Shay Banon). These have the
stemmers, stopword filters, query ranking, etc. all supported out of
the box. Plus they scale (both are built on top of Lucene, I should
mention). Full text search is not one of Mongo's strengths in my view,
and if you need features like faceted search (which I need), "more
like this" queries, etc. then Mongo does not cut it at the moment
(they say so themselves: http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo).

But is it important to keep the index in sync, and I think there are
some initiatives in adding listeners to the Mongo commit logs, which
would allow changing documents to be automatically sent to the search
server.

I therefore think this is almost out of scope for the Morphia project
at the moment. But worth discussing.

Regards,
OGG

Uwe Schäfer

unread,
May 10, 2010, 11:49:46 AM5/10/10
to mor...@googlegroups.com
Ólafur Gauti Guðmundsson schrieb:

Hi Ólafur,

> For my full text searching needs I use SOLR and I've also used
> ElasticSearch (which is implemented by Shay Banon).

ummm. did not know ElasicSearch. looks slick, thanks for the tip.

> Full text search is not one of Mongo's strengths in my view,

Fully agreed. For my usecase, though, searching for stemmed keywords
would be enough.

> I therefore think this is almost out of scope for the Morphia project
> at the moment. But worth discussing.

i could put it on top, if morphia was open enough to do so. i´d need
access to the mapping information and the ability to change it, as well
as the EntityListener thing Scott started implementing.

i´ll ponder it a little more...

cu uwe
--
THOMAS DAILY GmbH
Adlerstraße 19
79098 Freiburg
Deutschland
T + 49 761 3 85 59 0
F + 49 761 3 85 59 550
E scha...@thomas-daily.de
www.thomas-daily.de

Geschäftsführer/Managing Directors:
Wendy Thomas, Susanne Larbig
Handelsregister Freiburg i.Br., HRB 3947


Registrieren Sie sich unter https://www.thomas-daily.de/user/sign-in für
die TD Morning News, eine kostenlose Auswahl aktueller Themen aus TD
Premium, morgens ab 9:15 in Ihrer Mailbox.

Aktuelle Presseinformationen für die TD Morning News und TD Premium
nimmt unsere Redaktion unter reda...@thomas-daily.de entgegen.
Redaktionsschluss für die TD Morning News ist täglich um 8:45.

Register free of charge at https://www.thomas-daily.de/user/sign-in to
have the TD Morning News, a selection of the latest topics from TD
Premium, delivered to your mailbox from 9:15 every morning.

Our editorial department receives the latest press releases for the TD
Morning News and TD Premium at reda...@thomas-daily.de. The editorial
deadline for the TD Morning News is 8.45am daily.



Scott Hernandez

unread,
May 10, 2010, 12:14:31 PM5/10/10
to mor...@googlegroups.com
2010/5/10 Uwe Schäfer <u...@thomas-daily.de>:
> Ólafur Gauti Guðmundsson schrieb:
>
> Hi Ólafur,
>
>> For my full text searching needs I use SOLR and I've also used
>> ElasticSearch (which is implemented by Shay Banon).
>
> ummm. did not know ElasicSearch. looks slick, thanks for the tip.
>
>> Full text search is not one of Mongo's strengths in my view,
>
> Fully agreed. For my usecase, though, searching for stemmed keywords
> would be enough.
>
>> I therefore think this is almost out of scope for the Morphia project
>> at the moment. But worth discussing.
>
> i could put it on top, if morphia was open enough to do so. i´d need
> access to the mapping information and the ability to change it, as well
> as the EntityListener thing Scott started implementing.

I don't see any reason you would have problems here, even now. Really,
the FTS stuff is logically outside your entity and stored as synthetic
properties that are derived from the actual entity data. I suspect
that you will only use those synthetic fields when issuing queries. It
would be easy enough to put this in a base class that has a
@PrePersist call back now. If you wanted to sep. it into another class
then the @EntityListeners would help.

Uwe Schäfer

unread,
May 10, 2010, 4:10:42 PM5/10/10
to mor...@googlegroups.com
Am 10.05.2010 18:14, schrieb Scott Hernandez:

dear Scott

i just checked out your latest work on EntityListener. thanks.

> I don't see any reason you would have problems here, even now. Really,
> the FTS stuff is logically outside your entity and stored as synthetic
> properties that are derived from the actual entity data. I suspect
> that you will only use those synthetic fields when issuing queries.

right. that would be something like

@PrePersist
public DBObject PrePersistWithParamAndReturn(DBObject dbObj) {
//...
}

right? by synthetic field, you mean that attributes are added to the
returned DBObject, correct?

i´d agree with that.

problem is, how do i know the entity or its class ?
i´d need to do some reflection in order to make something like

@StemmedKeywords(indexName="title", stemmer=GERMAN,
stopWordExclusion=GERMAN)
private String germanTitle;

happen, as well as maybe the actual entity object (would be handier,
than the DBObject, i guess?).

where am i wrong?

the other thing: would there be a strong reason against making
EntityListeners generally applicable with

morphia.addEntityListener(IEntityListener el) ?

The reason i´m asking is, it is very easy to use feature annotations
like the above, but to fail registering the appropriate EntityListener.
I´ve seen this in JPA hell of a lot, so that we ended up doing a prescan
& validation of entity-class configurations.

cu & thx, uwe

Scott Hernandez

unread,
May 10, 2010, 6:24:09 PM5/10/10
to mor...@googlegroups.com
Yep, we could add an interface which allows the interception of *all*
lifecycle events.

interface LifecycleInterceptor {


DBObject prePersist(Object entity);
void postPersist(Object entity);

DBObject preLoad(DBObject dbObj, Object entity);
void postLoad(Object entity);

}

Where each method map one to one with the lifecycle annotations in entity class.

2010/5/10 Uwe Schäfer <u...@thomas-daily.de>:
> Am 10.05.2010 18:14, schrieb Scott Hernandez:
>
> dear Scott
>
> i just checked out your latest work on EntityListener. thanks.
>
>> I don't see any reason you would have problems here, even now. Really,
>> the FTS stuff is logically outside your entity and stored as synthetic
>> properties that are derived from the actual entity data. I suspect
>> that you will only use those synthetic fields when issuing queries.
>
> right. that would be something like
>
> @PrePersist
> public DBObject PrePersistWithParamAndReturn(DBObject dbObj) {
>  //...
> }
>
> right? by synthetic field, you mean that attributes are added to the
> returned DBObject, correct?

Yes, but in fact if you return null, or declare a void return then the
passed in DBObject is used. Maybe this is too complex since you will
probably only want to change the data in the dbObj and not replace the
reference. We can clean that up.

> i´d agree with that.
>
> problem is, how do i know the entity or its class ?
> i´d need to do some reflection in order to make something like
>
> @StemmedKeywords(indexName="title", stemmer=GERMAN,
> stopWordExclusion=GERMAN)
> private String germanTitle;
>
> happen, as well as maybe the actual entity object (would be handier, than
> the DBObject, i guess?).

Yep, reflection is your friend. At the moment you can actually play
with the MappedClass/MappedField instances to affect this. That is
another part of the extensions that I have on the list but haven't
coded yet. It would let you easily add annotations to the registration
(mapping metadata discovery) process where the morphia will inspect
(and do the reflection for you) and then you can define added
behavior. Right now you can do this by adding Classes to the statics
in those Mapped* classes.

Things are close but in another month they probably would have been
ready for things like this.

In your handler you can ask the mapper instance for all the
annotations that are relevant to the class you are working with.

> where am i wrong?
>
> the other thing: would there be a strong reason against making
> EntityListeners generally applicable with
>
> morphia.addEntityListener(IEntityListener el) ?
>
> The reason i´m asking is, it is very easy to use feature annotations like
> the above, but to fail registering the appropriate EntityListener.
> I´ve seen this in JPA hell of a lot, so that we ended up doing a prescan &
> validation of entity-class configurations.

I'm not sure what you mean, can you give me an example? Is it just
that you forget to add the @EntityListeners to the Entity class?

> cu & thx, uwe
>

Uwe Schäfer

unread,
May 10, 2010, 6:54:22 PM5/10/10
to mor...@googlegroups.com
Am 11.05.2010 00:24, schrieb Scott Hernandez:

hi Scott

> Yep, we could add an interface which allows the interception of *all*
> lifecycle events.

that´d be awesome.

> It would let you easily add annotations to the registration
> (mapping metadata discovery) process where the morphia will inspect
> (and do the reflection for you) and then you can define added
> behavior. Right now you can do this by adding Classes to the statics
> in those Mapped* classes.

will have to look into the inner workings of those classes. i did not
dig into yet. thanks for pointing me.

> I'm not sure what you mean, can you give me an example? Is it just
> that you forget to add the @EntityListeners to the Entity class?

sorry for my 'wounded' english - yes, that was exactly what i meant.
if one can get away without that (semantic) redundancy (listener on top
and annotation on field), its harder to produce errors in the modelling
phase.

cu uwe

Scott Hernandez

unread,
May 10, 2010, 7:14:36 PM5/10/10
to mor...@googlegroups.com
2010/5/10 Uwe Schäfer <u...@thomas-daily.de>:
> Am 11.05.2010 00:24, schrieb Scott Hernandez:
>
> hi Scott
>
>> Yep, we could add an interface which allows the interception of *all*
>> lifecycle events.
>
> that´d be awesome.
>
>> It would let you easily add annotations to the registration
>> (mapping metadata discovery) process where the morphia will inspect
>> (and do the reflection for you) and then you can define added
>> behavior. Right now you can do this by adding Classes to the statics
>> in those Mapped* classes.
>
> will have to look into the inner workings of those classes. i did not dig
> into yet. thanks for pointing me.

I'll put a little sample (in the next few hours) in the tests to add a
new annotation type (just in the sample). It will help flesh things
out a bit. It will also be a good example of that extension point.

>> I'm not sure what you mean, can you give me an example? Is it just
>> that you forget to add the @EntityListeners to the Entity class?
>
> sorry for my 'wounded' english - yes, that was exactly what i meant.
> if one can get away without that (semantic) redundancy (listener on top and
> annotation on field), its harder to produce errors in the modelling phase.

Yeah, but the nice things about the method annotations is the
orthogonality they have with the actual entity. Either way you have to
create the association (either with a line or two of code, or an
annotation). The nice thing about the annotation is that it does
require special method names, or implementing an interface. It also
means you can take and leave different parts.

> cu uwe
>

Uwe Schäfer

unread,
May 11, 2010, 5:35:31 AM5/11/10
to mor...@googlegroups.com
Scott Hernandez schrieb:

dear Scott,

> It will also be a good example of that extension point.

great! thank you.

> The nice thing about the annotation is that it does
> require special method names, or implementing an interface. It also
> means you can take and leave different parts.

and you have lots of opportunities for creating errors not being dealt
with before runtime. :)

and, you have to look up how exactly the callback annotation was
supposed to be used (parameters, returns), whereas with an interface the
IDE gives you a good recipt what to do.
and of course, there is the good old issue where you ask yourself:
if B extends A and both have a @prepersist (or even more than one), in
which order will they be called!? that might not be obvious.

frankly, i like annotations as meta-data, but when it comes to altering
control-flow (like callbacks), i like the concept of an interface.

that might just be a matter of personal taste. :) in the end it wont
matter much, either way.

thanks for your support!

cu uwe

Scott Hernandez

unread,
May 11, 2010, 4:37:38 PM5/11/10
to mor...@googlegroups.com
I just put up an example. It isn't perfect but it should do what you
want, for now.

There are a few things to note:
1.) You can create your own Annotation (or interface if you would rather)
2.) You can add the annotation to MappedClass/Field.interestingAnnotations
3.) Write some code to interpret the annotation and do something.
-- You can use the new EntityInterceptor interface (register with the Mapper)
-- You can do it in your @LifeCycle methods, or @EntityListeners classes

I also added a new lifecycle event, PreSave.

The lifecycle is like this. PrePersist [POJO->DBObject),
PreSave(before DBObject -> save() ], PostPersist [after
DBObject->save(); only happens to entities]; PreLoad(DBOject->POJO),
PostLoad(after POJOs are created)

2010/5/11 Uwe Schäfer <u...@thomas-daily.de>:
> Scott Hernandez schrieb:
>
> dear Scott,
>
>> It will also be a good example of that extension point.
>
> great! thank you.
>
>> The nice thing about the annotation is that it does
>> require special method names, or implementing an interface. It also
>> means you can take and leave different parts.
>
> and you have lots of opportunities for creating errors not being dealt with
> before runtime. :)
>
> and, you have to look up how exactly the callback annotation was supposed to
> be used (parameters, returns), whereas with an interface the IDE gives you a
> good recipt what to do.

A lot of IDE's have support for Annotations, but I know what you mean.

> and of course, there is the good old issue where you ask yourself:
> if B extends A and both have a @prepersist (or even more than one), in which
> order will they be called!? that might not be obvious.

It will never be clear. Unless otherwise specified it is undefined :(

> frankly, i like annotations as meta-data, but when it comes to altering
> control-flow (like callbacks), i like the concept of an interface.
>
> that might just be a matter of personal taste. :) in the end it wont matter
> much, either way.

Yep, I hear you. You make good points and I agree, in fact I started
with interfaces but the JPA stuff is annotations and we wanted to
stick close to that.
Reply all
Reply to author
Forward
0 new messages