Duplicate document ID warning and incorrect record count/missing records

320 views
Skip to first unread message

Alex Kahn

unread,
Nov 1, 2011, 6:18:39 PM11/1/11
to Thinking Sphinx
Hi,

I'm adding a new index to my application. It looks like this:

class Account < ActiveRecord::Base
define_index do
indexes account_name.first_name
indexes account_name.last_name
indexes email_addresses.email_address

has created_at

set_property :delta => :datetime, :threshold => 2.minutes
end
end

I'm testing how long the full index takes to generate on a staging
server where we typically have only sanitized data from production.
But for this task, I'm working with our entire accounts,
account_names, and email_addresses tables from production.

When I generate the index, I get the following warning during the
accounts index phase:

WARNING: duplicate document ids found

In the Rails console, I observe the following:
>> Account.search.total_entries
=> 260793
>> Account.count
=> 602083

Locally, with a much smaller subset of the data, I also get a
different count from each data source, but I don't receive the
"duplicate document ids" warning when generating the index.

My research so far has indicated that this is an issue with merging
indexes. But here I'm generating a full index, not a generating a
delta index and then merging it into a full index.

My questions are:

1. The warning and the discrepancy in count, are they related?
2. What does the warning mean?
3. Is all of my data accessible via searching, despite the different
counts?
4. How can I fix this?

Thanks in advance for any assistance,
Alex Kahn

P.S. I'm using Rails 2.3.14, Sphinx 0.9.9, thinking-sphinx 1.4.7, ts-
datetime-delta 1.0.2

Pat Allan

unread,
Nov 2, 2011, 9:32:19 AM11/2/11
to thinkin...@googlegroups.com
Hi Alex

How many other Sphinx indices do you have in your app? Just wondering if there's some conflict somehow, though that surely would crop up in dev as well.

As for the missing records - do you have many accounts with no first name, last name or email addresses? I remember reading somewhere that Sphinx ignores records that have no data in their fields. Not sure if these two problems are related to each other.

Also, going by an issue you logged on Github - is this the app you're using the indexed_models setting with? Can you confirm that all relevant models are in that setting?

Cheers

--
Pat

> --
> You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
> To post to this group, send email to thinkin...@googlegroups.com.
> To unsubscribe from this group, send email to thinking-sphi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en.
>

Alex Kahn

unread,
Nov 2, 2011, 4:03:55 PM11/2/11
to Thinking Sphinx
Hi Pat,
Thanks for your response. We have 7 other models indexed by Sphinx.
They are all much smaller tables, the largest containing around 5,000
records. Their indices are far more complex, though.
There is no missing data for email addresses. However, there are many
account_names records that have contain NULL or "" values for the
first_name and/or last_name columns. Indeed, removing the two
account_names lines from the define_index block and re-running the
index task, removes the duplicate document ID warning. However, due to
an unrelated configuration issue, I'm not able to get to get the
total_entries count from a console. Using the `search` command line
tool, it seems that the account_core index now has a total of 602083
documents (all of them!).
So it looks like the blank data is the cause for the duplicate
document id warning and the seemingly-missing records. What would you
suggest as a way to work around this issue? I can try casting the NULL
values to empty strings, or adding other data to the index that would
help Sphinx distinguish between records (but wouldn't the timestamp
and email address fields do that?). Anything you'd suggest?
And yes, the Account model is listed in the indexed_models setting.
Cheers,Alex

Alex Kahn

unread,
Nov 2, 2011, 4:13:48 PM11/2/11
to Thinking Sphinx
Sorry. Looks like Google Groups ate my newlines. :(

Pat Allan

unread,
Nov 3, 2011, 3:32:32 AM11/3/11
to thinkin...@googlegroups.com
Hi Alex

A quick work-around could be to add a custom field to the mix, to ensure there's always data:

indexes "'account'", :as => :placeholder_field

Does that help?

--
Pat

Reply all
Reply to author
Forward
0 new messages