Re: [Whoosh] Duplicates in results

Matt Chaput

unread,

Aug 9, 2012, 1:39:20 PM8/9/12

to who...@googlegroups.com

On 09/08/2012 1:29 PM, Michael Foord wrote:
> I'm trying to eliminate duplicates from the results of queries.
> Obviously eliminating them afterwards is easy enough, but I'd like to be
> able to report the total number of actual results to the user as well.

Just so we're clear, you have duplicates in the index, and you're using
collapsing to eliminate them from the results? Is there some reason you
can't just remove/not index the duplicates?

Thanks,

Matt

Michael Foord

unread,

Aug 9, 2012, 8:50:29 PM8/9/12

to who...@googlegroups.com

Ok, so with a bit of experimentation I find out I *do* have duplicates in the index. However I shouldn't. I guess my actual question is - why have I ended up with duplicates? (Particularly note below that "id" is specified in the whoosh schema as being unique - yet if I search for documents by "id" I actually get several results.)

This is the full code below, hooked up with Django signals.

WHOOSH_SCHEMA = fields.Schema(

title=fields.TEXT,

description=fields.TEXT,

summary=fields.TEXT,

# Stored to present in results

id=fields.NUMERIC(stored=True, unique=True),

)

def update_index(sender, instance, created, **kwargs):

ix = index.open_dir(settings.WHOOSH_INDEX)

writer = AsyncWriter(ix)

try:

if created and not instance.active:

return

if not instance.active:

if instance.was_active:

writer.delete_by_term('id', instance.id)

return

method = writer.update_document

if created or instance.active and not instance.was_active:

method = writer.add_document

# Note that title, location and description are Unicode already

method(

title=instance.title,

summary=instance.summary,

description=strip_tags(instance.description),

id=instance.id,

)

finally:

writer.commit(optimize=True)

signals.post_save.connect(update_index, sender=Item)

An edit to an item, a modification, should call update_document (unless a previously inactive - and therefore not in the index - item becomes active). A created item - or an inactive item becoming active - should add the document. Any ideas?

All the best,

Michael Foord

Thanks,

Matt

Michael Foord

unread,

Aug 9, 2012, 8:52:18 PM8/9/12

to who...@googlegroups.com

On Thursday, 9 August 2012 18:39:20 UTC+1, Matt Chaput wrote:

On 09/08/2012 1:29 PM, Michael Foord wrote:
> I'm trying to eliminate duplicates from the results of queries.
> Obviously eliminating them afterwards is easy enough, but I'd like to be
> able to report the total number of actual results to the user as well.

Just so we're clear, you have duplicates in the index, and you're using
collapsing to eliminate them from the results?

That was my goal. I initially thought maybe if a search matched several fields a document could be returned multiple times, which is why I was trying to uniquify the results. It turns out I do have duplicates in the index. My main question is why do I have duplicates (see the other post)? As a side question though, should my collapsing have worked - and if not what should I have done instead?

All the best,

Michael

Matt Chaput

unread,

Aug 15, 2012, 5:47:11 PM8/15/12

to who...@googlegroups.com

> That was my goal. I initially thought maybe if a search matched
> several fields a document could be returned multiple times, which is
> why I was trying to uniquify the results. It turns out I do have
> duplicates in the index. My main question is why do I have
> duplicates

Hi, sorry for the late reply. Work :(

I'm afraid I don't know why you have duplicates. Using a numeric field
as the unique key does work (or at least, I have a
unit test for it that's passing :/ ).

You might need to try reducing your indexing pipeline to a reproducable
test case.

I should point out you're doing extra work by first deleting any
existing documents, and then using update_document(). update_document()
just deletes any documents matching the unique fields and then calls
add_document().

> As a side question though, should my collapsing have worked - and if
> not what should I have done instead?

I think there's a bug in the code for getting the number of results when
the results are collapsed and the number of results to return is
limited (with the limit= keyword). I tried to reproduce your problem and
found a bug, but it might be a different bug :/ Can you try changing
your code to use limit=None and see if it fixes the number of hits reported?

Thanks!

Matt

Reply all

Reply to author

Forward