top-K results in Lucene

26 views
Skip to first unread message

Shoaib Akram

unread,
Jun 3, 2024, 9:03:37 AMJun 3
to dotCMS User Group
Hi All,

I have a question about Lucene's internal behavior when search is restricted to the top-K documents. For example, if we say searcher.search(query, N), then the question is does Lucene reads all the matching docIDs from disk-resident index, scores each and every one of them, and returns the top K documents.  Or does it somehow has a mechanism to retrieve fewer documents compared to the case where K is very large or unlimited (meaning return all matching results).

I am thinking from the I/O perspective. Does limiting K to say 20 instead of 2000 reduce the I/O from disk?  If Lucene does have a way to limit I/O from disk if K is small, then how does it achieve that? I know there are Impacts associated with each document, but is there some other way in which Lucene can do fewer reads from disk to find the top K documents?

Thanks for any insights.

Shoaib

Shoaib Akram

unread,
Jun 3, 2024, 9:33:27 AMJun 3
to dot...@googlegroups.com

Apologies wrong group to post this query 😊

 

--
http://www.dotcms.com - Open Source headless/hybrid CMS
---
You received this message because you are subscribed to a topic in the Google Groups "dotCMS User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dotcms/kqdHGitOI9w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dotcms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dotcms/1b672f9b-4b17-41fe-ba4d-5406cbcc3706n%40googlegroups.com.

Will Ezell

unread,
Jun 3, 2024, 9:35:20 AMJun 3
to dot...@googlegroups.com
We don't store the whole document when adding content to the index - we only store a few fields, the identifier and the inode for example.  This means that when querying, elasticsearch is only querying/reading its indexes and not the documents themselves.  

These are queried fully and the elasticsearch results are returned as just an array of inodes(Strings) ordered as specified.  dotCMS takes these and hydrates them appropriately (from cache memory, then cache disk, then from db on cache miss) with the correct content objects.

So, limiting the number of results will not really affect your disk io or your performance.  What will affect the query performance is to ensure that your caches are properly sized and are already "hot" so the whole contentlet does not need to be reloaded from the db.





--
http://www.dotcms.com - Open Source headless/hybrid CMS
---
You received this message because you are subscribed to the Google Groups "dotCMS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dotcms+un...@googlegroups.com.


--



382 NE 191st St #92150
Miami, Florida 33179-3899
Main: 
305-900-2001 | Direct: 978.294.9429

Reply all
Reply to author
Forward
0 new messages