Sorting without using Facets

52 views
Skip to first unread message

Mark Williams

unread,
Jul 13, 2013, 8:17:22 PM7/13/13
to sensei...@googlegroups.com
In our in index we have a field which stores the modification timestamp of the document. The granularity of this value is millisecond. We need to be able to sort results by this field. Sensei requires that in order to sort the field the field must have been defined as a facet. (Simple facet will suffice). However, given the high cardinality of this field it's not possible to do this or else we would quickly run out of memory. (The cardinality would approximate the number of documents in the index).

Is there any other way to perform sorting without using facets? Or would we need to modify the Collector in order to use the traditional method of sorting using field cache?

Thanks,

--Mark

Mark Williams

unread,
Jul 13, 2013, 9:33:57 PM7/13/13
to sensei...@googlegroups.com
A little more context. If I don't specify the sort field as a facet, then ultimately I get this error:

2013-07-13 18:25:43,805 [norbert-message-executor-thread-8] INFO com.browseengine.bobo.sort.SortCollector - doing default lucene sort for: <custom:"modificationDate_sort": null>!
2013-07-13 18:25:43,805 [norbert-message-executor-thread-8] ERROR com.senseidb.svc.impl.CoreSenseiServiceImpl - lucene custom sort no longer supported: modificationDate_sort
java.lang.IllegalArgumentException: lucene custom sort no longer supported: modificationDate_sort
at com.browseengine.bobo.sort.SortCollector.getNonFacetComparatorSource(SortCollector.java:178)
at com.browseengine.bobo.sort.SortCollector.getComparatorSource(SortCollector.java:216)
at com.browseengine.bobo.sort.SortCollector.buildSortCollector(SortCollector.java:262)
at com.browseengine.bobo.api.MultiBoboBrowser.getSortCollector(MultiBoboBrowser.java:358)

I'm not sure how to specify the Sort that's not custom but instead is one of the other types. (SortField.STRING, SortField.INT etc). I've tried defining it in the table definition like this but to no avail:

<column name="modificationDate_sort" type="long"/> 
or 
<column name="modificationDate_sort" type="string"/>

John Wang

unread,
Aug 8, 2013, 4:28:54 PM8/8/13
to sensei...@googlegroups.com
Hi Mark:

    I think that functionality was removed in one of the releases. Because if you don't specify it as a facet, Lucene will try to load a field cache into heap and managed outside sensei knowing. In a highly realtime environment, it causes problems.

-John


--
You received this message because you are subscribed to the Google Groups "Sensei" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sensei-searc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Mark Williams

unread,
Aug 8, 2013, 6:10:00 PM8/8/13
to sensei...@googlegroups.com
Thanks for the response John.

Are you saying that using the FieldCache is anyway is a problem with using Sensei with realtime load? Do you mean it should be avoided for everything including filters such as FieldCacheTermsFilter? 

How do you avoid the same problem with facets because don't you need to reload the forward indexes everytime the reader is reopened?

I will go back and look at this again. We have another high cardinality field with a simple facet and although it's taking a large portion of heap it's not as bad as we thought it would be so I may reconsider using the simple facet again for sorting. I'm still unsure how to calculate memory requirements for facets. I was using this formula:

Total bytes = 4 * cardinality * numberOfDocs

Since the values we are sorting on are unique this would be:

Total bytes = 4 * numberOfDocs * numberOfDocs = 4 * numberOfDocs^2

For some reason we are getting no where close to this.

--Mark

John Wang

unread,
Aug 9, 2013, 1:01:23 AM8/9/13
to sensei...@googlegroups.com
as for version 3.5, fieldcache is loaded lazily at query time. This works when you have a static index and you can "warm up" the searches by issuing queries. This does not work well with realtime indexing since the underlying index is changing constantly.

defining a facet on this field is essentially creating a fieldcache for sorting, except that you are letting sensei controlling how/when it is loaded.

As for the expected memory size, is maxDocs * sizeof(your primitive), so for integer type values, 10M docs, it would be 40MB.

-John
Reply all
Reply to author
Forward
0 new messages