Doing a simple Google-style search with Raven

120 views
Skip to first unread message

Pete

unread,
Oct 9, 2013, 10:42:45 AM10/9/13
to rav...@googlegroups.com
Hi all,

I'm trying to use Raven to do a simple, useful Google-style search. It looks like it should be possible using recent features of Raven, but it's not quite there to be useful. I don't want to have to switch to ElasticSearch if I can possibly help it.

I'm not very experienced with Lucene.

The problem is that I find the default analyzer sucks in practice, because it returns nothing for 'bio' when there are plenty of 'biotopes'. If however I use a postfix wildcard on the last term (q + "*") the results are ok.

The problem with this is that it seems to make the highlighting feature return no results. Which is completely useless, since Google-style search means showing contextual snippet to show you the context of the match.

Is there a solution to either of these problems? Ideally, to the first: is there is a better way than using the standard analyzer to get more intuitive results for a typical user?

Cheers,
Pete.

 

Khalid Abuhakmeh

unread,
Oct 9, 2013, 10:45:42 AM10/9/13
to rav...@googlegroups.com
May I recommend a post I wrote about the analyzers. This might help you figure out what each of the analyzers do.

Kijana Woodard

unread,
Oct 9, 2013, 10:53:19 AM10/9/13
to rav...@googlegroups.com
I just did a search on google for bio to see what would happen.

Through several pages of search results, I found
- pages containing "bio". Same as you would get through the analyzer.
- pages containing "biography" which really is a synonym for bio.
- pages containing "biology" which is another synonym for bio, at least among med students.
- urls containing "bio" which I feel is some google sauce that doesn't generally apply to search.

No biotopes. I'm wondering if there is an expectations issue with the users. :-]

That said, check out the analyzers and see if something else will work.
Also, is there a way to "shift the problem" meaning do some work up front that makes it easier on the analyzer. Perhaps when the doc is stored, if "bio" is detected, it can save it in a tag field that is indexed and searched with the regular content.


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Khalid Abuhakmeh

unread,
Oct 9, 2013, 11:09:57 AM10/9/13
to rav...@googlegroups.com
Well you might also be able to get a better search by analyzing the same field multiple times with different analyzers. The lowercasekeywordanalyzer turns your whole string into one big value, while the standard analyzer breaks up your string into terms. Searching through both of them might give you the results you are looking for.

Db.Query<Article>
     .Where(x => x.Standard.Contains("{0}*", search)
     .Where(x => x.LowerCase.Contains("*{0}*", search)
     .ToList();

The list will be ordered by the lucene score if I am not mistaken.

The only thing you might have to worry about is the *{0}* construct, because that will do a full table scan which can cause searches to be slower.

Khalid Abuhakmeh

unread,
Oct 9, 2013, 11:13:28 AM10/9/13
to rav...@googlegroups.com
Oh by the way, moving to ElasticSearch won't be much different than working with RavenDB since both are based on Lucene and offer the same analyzers.


Search is one of those things that we gloss over how difficult it can be to set up. It will take time to work out the kinds. Which there was a GoogleAnalyzer :P

Pete

unread,
Oct 9, 2013, 11:53:43 AM10/9/13
to rav...@googlegroups.com
Thanks for your ideas both, but I don't think they will help. Perhaps I gave a bad example; the point is that exact full word matches really aren't good enough for a useful search engine - and I've seen a number of naive implementers on the web such as myself come across this problem straight away. I already read your article Khalid but I don't see how any of the built in raven analyzers will help...

The real problem is that the results are fine using a wildcard*, but highlighting doesn't seem to work with wildcards. 

I don't know if this is a Raven bug or by design.

Pete

Kijana Woodard

unread,
Oct 9, 2013, 11:32:15 AM10/9/13
to rav...@googlegroups.com
Heh. Gloss over search is right. Users think "mars landing, meh. when are we landing on an exoplanet in Andromeda already!". 

In comments: "In short, avoid “generic” search as much as possible – it’s too hard."

Also, google analyzer is easy to achieve.
Write all your documents out to publicly available html pages. Your users go to google and type: "site://yoursite.com whatever they want to find". :-]


Kijana Woodard

unread,
Oct 9, 2013, 12:27:06 PM10/9/13
to rav...@googlegroups.com
IIRC, the wildcards feature is new for 2.5 _and_ it was a PR. You might want to peek at the code and see if you can tweak it for wildcard scenarios.

Kijana Woodard

unread,
Oct 9, 2013, 12:27:36 PM10/9/13
to rav...@googlegroups.com
Sorry, meant "the highlighting feature is new..."

Chris Marisic

unread,
Oct 9, 2013, 1:13:27 PM10/9/13
to rav...@googlegroups.com
I'm being slightly pedantic, but there is absolutely nothing simple about google search. The only thing that is simple is the user experience they provide.

On Wednesday, October 9, 2013 10:42:45 AM UTC-4, Pete wrote:

Kijana Woodard

unread,
Oct 9, 2013, 1:15:15 PM10/9/13
to rav...@googlegroups.com
Exactly.


--

Matt Johnson

unread,
Oct 9, 2013, 2:25:50 PM10/9/13
to rav...@googlegroups.com
You might want to consider an NGram search.  It's a little more invasive, and requires a lot more storage, but might get the effect you are looking for.

Google search for "RavenDB NGram".  You may need to update it for the latest version.

(It would be nice if this was maintained...)
Message has been deleted

Oren Eini (Ayende Rahien)

unread,
Oct 9, 2013, 2:29:15 PM10/9/13
to ravendb
Thank you, that is absolutely correct.

Put simply, if you want to match the Google experience, be sure to have the Google budget.

Oren Eini
CEO
Hibernating Rhinos
Office:    +972-4-674-7811
Fax:       +972-153-4622-7811





--

Khalid Abuhakmeh

unread,
Oct 9, 2013, 4:41:22 PM10/9/13
to rav...@googlegroups.com
Honestly, although the highlighting feature is cool, I would rather use something on the client side to highlight the words that match. You know the documents coming back from RavenDB have a match. Then pass the search criteria back up to the view and then use a client side library to highlight.


That's just my two cents. Highlighting just feels like a UI thing rather than a server thing. I guess highlighting might be important if you want to show a chunk of text closest to the area of the first appearance of a search criteria, but you could do that after you pull the documents.

So focus on the search and get it working the way you like, and once you have all the right analyzers in place, start worrying about the UI aspects.

Kijana Woodard

unread,
Oct 9, 2013, 5:11:33 PM10/9/13
to rav...@googlegroups.com
+1. That was my feeling all along. But I guess if you want highlighting without javascript...

Matt Warren

unread,
Oct 9, 2013, 6:06:11 PM10/9/13
to rav...@googlegroups.com
Also google is searching the web which gives them the advantage of hyperlinks. They use page rank as and indicator as well as the actual text. I.e a page is more relevant if other relevant pages link to it

You don't have that indicator in standalone docs that aren't hyperlinked

Pete

unread,
Oct 10, 2013, 4:37:32 AM10/10/13
to rav...@googlegroups.com
Hmm, lots of straw men here! I said I'm happy with the results using q + "*"  - I'm not exactly expecting Google quality!

The highlighting feature obviously isn't possible to implement client side when you have a long piece of text from which you want to show a contextual fragment - that's why lucene/raven provide this feature. (And this is why I said it's a Google-style search.)

The question I'd really like to know is why highlighting doesn't work with wildcards - is it a bug, or something more fundamental?

I'll check out the NGram analyzer too; thanks Matt.

Cheers,
Pete.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Pete

unread,
Oct 10, 2013, 7:03:04 AM10/10/13
to rav...@googlegroups.com
...Looking at the source, it looks like Raven uses FastVectorHighlighter class, which seems to have supported wildcard queries for a couple of years in Javaland at least. https://issues.apache.org/jira/browse/LUCENE-1889 

I'm afraid I'm totally out of my depth in the Raven / Lucene source code!

Itamar Syn-Hershko

unread,
Oct 10, 2013, 8:46:25 AM10/10/13
to rav...@googlegroups.com
That is so far from the truth

The only thing special about Google Search is the amount of data they handle and the speed in which they return results

Everything else is a solved problem - using proper stemming algorithms like Porter or KStem (for English), Common terms query and synonyms maps (if you really want to) will get you very far.

There is absolutely no evidence saying Google Search results are the best around. We simply don't have a way to measure that. You should be happy with the search results the technology you selected provides you with, that is all. Also remember every dataset has its own unique characteristics, and Google is currently acting as a silver bullet in the search field. They also involve scoring algorithms which have nothing to do with text-based relevance.

Pete, it sounds like you are looking for search engine functionality. While RavenDB is built on top of Lucene and exposes much of its functionality, it is meant to be a document-database and not a search engine. You really want to use Elasticsearch, which is a proper search engine supporting everything you will need from a search engine, including bleeding edge analyzers. Also look up MatchQuery.

You can still use Elasticsearch in conjunction with an external data store - look up external versioning

Itamar Syn-Hershko

unread,
Oct 10, 2013, 8:47:39 AM10/10/13
to rav...@googlegroups.com
My 0.02c, don't do that. If you use synonyms, stemming and other search optimizations your highlights will be wrong - especially best snippet selection. Also, on the server they execute very fast because of reuse of term positions etc

Itamar Syn-Hershko

unread,
Oct 10, 2013, 8:50:12 AM10/10/13
to rav...@googlegroups.com
FVH has had its fair share of bugs along the years. Wildcards should be supported because they just expand to actual terms under the hood (which is also why you shouldn't be using them extensively), but I wouldn't be surprised if the current released version has some bugs with it.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Pete

unread,
Oct 10, 2013, 9:08:28 AM10/10/13
to rav...@googlegroups.com
Thanks Itamar. 

Again, I'm perfectly ok with the results I get using Raven and a wildcard, it's just that highlighting does not work with it.

I created a failing test, at least. https://gist.github.com/sumothecat/6917915

Pete.

Kijana Woodard

unread,
Oct 10, 2013, 3:38:39 PM10/10/13
to rav...@googlegroups.com
I wouldn't say straw men, I would say differences in head space.

When we post on a forum, we have, hopefully, been thinking about the problem for  awhile. That means lots of stuff is "obvious" in our heads. The people reading the post are busily going about their lives and the only context is the post.

"from which you want to show a contextual fragment"

makes a world of difference for understanding compared to

"Google-style search"

Thanks for posting the failing test.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages