Re: [RavenDB] facet term limit

248 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 7:34:11 AM8/15/12
to rav...@googlegroups.com
I am not sure that I am following?
Can you explain with a real example?

On Wed, Aug 15, 2012 at 4:35 AM, Michael Weber <mtw...@gmail.com> wrote:
I am thinking about a new feature for facets.  There is a hard limit of MaxPageSize for number of terms in a facet result.  I am looking at a application where we would like to limit the number of terms for (for example 2).  The two that it picks should be in sorted order.  Then we would also need the number of terms that would have been returned assuming all were.  Term1 (x results), Term2 (y results), More (z more).

Looking at FacetedQueryRunner, it looks to be very easy to add a field to the Facet document for MaxResults, then use that in the ExecuteGetTermsQuery call.

I'm not 100% sure how to do the remaining terms field.
#1) Is spinning through the term list fast, even with many, many terms?
#2) There isn't a natural place to add the remaining terms field count.  The only thing I could think of is extending the result to include not only Category => Results, Category_Stats => { remaining terms }

Is this something that is useful for a pull request?

mike

Michael Weber

unread,
Aug 15, 2012, 12:05:00 PM8/15/12
to rav...@googlegroups.com
Take the standard camera example.  Imagine you had a store with 10,000 SLR cameras from 100 different manufacturers.  If you search for SLR camera, and want to facet via Manufacturer, then you could have 100 terms for the Manufacturer facet.  This could be slow, so instead of doing all 100 facets, do the first 10 or 20, then have a link that says something like "80 more manufactures".  It still allows a manufacturer facet, but limits the result set on a per facet basis.

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 12:15:59 PM8/15/12
to rav...@googlegroups.com
Oh, okay.
Yes, that sounds good.
Pull request for that would be great.

Michael Weber

unread,
Aug 15, 2012, 12:55:57 PM8/15/12
to rav...@googlegroups.com
I don't see how it's possible to implement the total terms count result without breaking compatibility by changing Dictionary<string, IEnumerable<FacetValue>> to a full object.

Is this a problem?

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 12:57:32 PM8/15/12
to rav...@googlegroups.com
We are going to do enough breaking changes :-)

Michael Weber

unread,
Aug 15, 2012, 1:43:39 PM8/15/12
to rav...@googlegroups.com
After starting to look into this, it doesn't seem to be possible to be accurate on the total terms statistic when there is sharding involved.  LazyFacetsOperation.HandleResponses seems to handle the sharding case by adding up all of the counts from the various servers involved in sharding.  This is fine for the ranges, but we cannot simply add up the TotalTerms field from each server since two servers may have had the same term.

It seems that the only way to produce an accurate term total with sharding would be to return all of the terms in the facet result (but not execute the lucene facet count query).  This would allow the sharding response processor to determine which terms are duplicated.  But this seems like a lot of data to be sending down the wire.

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 1:53:34 PM8/15/12
to rav...@googlegroups.com
Michael,
In most cases, the total number of terms you have is low, dozens to hundreds, top.
Seems reasonable to me.

Michael Weber

unread,
Aug 15, 2012, 3:04:37 PM8/15/12
to rav...@googlegroups.com
Cool -- that actually might be useful too since you could make the "More" link actual produce the list of other manufacturers, there just wouldn't be counts associated with them.

Michael Weber

unread,
Aug 16, 2012, 12:48:53 AM8/16/12
to rav...@googlegroups.com
After spending the day on this, I think I've come up with a way to do virtually limitless terms for facets.  Instead of issuing a query per facet term and searching through all of the facet terms for the index regardless whether they are in the baseQuery or not, I just issue the baseQuery with a custom Lucene Collector.  Doing this allows us direct access to the terms (and counts) associated with that query.

Using the freedb database (3.1 million records) and an index on genre and year, I can perform a full counting facet on both the genre (48,000 terms) and year (1200 terms) in 1.2 seconds after warmup.

I have also added support for limiting the number of returned FacetValues with a MaxResults field Facet document, and I return all terms that match the query as well.  There is also a sort mode added that allows you to decide what "MaxResults" facet values to return (sort by term ASC, term DESC, number of hits ASC, number of hits DESC).


I still have some more checking to do before submitting a pull request, but this commit is the bulk of it.

mike

Oren Eini (Ayende Rahien)

unread,
Aug 16, 2012, 4:27:59 AM8/16/12
to rav...@googlegroups.com
Michael,
That sounds awesome!
Waiting eagerly for the pull request.
In addition to that, can you send us the CLA?

Michael Weber

unread,
Aug 16, 2012, 4:57:14 AM8/16/12
to rav...@googlegroups.com
Sent pull request, I'm pretty sure everything is in order, and there are included tests to verify the new behavior.  Sure, will send the CLA in the morning.

mike

Michael Weber

unread,
Aug 16, 2012, 5:07:12 AM8/16/12
to rav...@googlegroups.com
I also found one more performance enhancement, if you have multiple facet queries (like Genre and Year), then we only run the Lucene query once and count the terms for each of the queries in one pass through the Collector.

For a practical example, from the freedb database, querying for "a* OR b*" in artist and album title takes 1,400ms and returns 600k results, adding a facet to the query over Genre and Year takes ~1,650ms and returns 16,000 genre term groups and 500 year term groups.

Oren Eini (Ayende Rahien)

unread,
Aug 16, 2012, 9:35:52 AM8/16/12
to rav...@googlegroups.com
Looking forward to going over the code once we have the CLA

Matt Warren

unread,
Aug 17, 2012, 4:33:58 PM8/17/12
to rav...@googlegroups.com
This is really nice, I can't believe I never thought of using a collector to implement facets. It's a really nice solution and much more efficient than the old method.

Oren Eini (Ayende Rahien)

unread,
Aug 20, 2012, 12:54:27 PM8/20/12
to rav...@googlegroups.com
I agree, I just went over the code and it is really nice.
I took your idea a bit further and made sure that we only do a single pass on the index, regardless of how many facets & terms we have, even if we have ranges, too.

Michael Weber

unread,
Aug 20, 2012, 2:54:34 PM8/20/12
to rav...@googlegroups.com
Very nice, I didn't think of that.  It looks like you used it to solve the exclusive/inclusive TODO in the facet runner also.

Oren Eini (Ayende Rahien)

unread,
Aug 20, 2012, 6:03:43 PM8/20/12
to rav...@googlegroups.com
Yeah, that was along the way

ZNS

unread,
Oct 4, 2012, 3:44:39 PM10/4/12
to rav...@googlegroups.com

I'm working a bit with facets now and this seems cool. I was just wondering if this was ever implemented, since I the pull request is still active, maybe it's just lack of experience of git ;)

ZNS

unread,
Oct 4, 2012, 3:46:25 PM10/4/12
to rav...@googlegroups.com

Got a bad typo there, it's meant to say "just MY lack of epxerience with git..".

Oren Eini (Ayende Rahien)

unread,
Oct 4, 2012, 3:52:46 PM10/4/12
to rav...@googlegroups.com
It has been pulled, yes. To the 1.2 branch, though
Reply all
Reply to author
Forward
0 new messages