null values in solr indexes

1,264 views
Skip to first unread message

kevin

unread,
Jul 14, 2009, 10:19:17 AM7/14/09
to django-haystack
Hi,

I've encountered a problem indexing data via Solr - I've got a fairly
complex schema with a lot of integer and date fields, for which many
records contain null values. As a result I encountered a problem
indexing nulls.

My understanding is that Solr handles nulls by leaving the fields out
of the document record. However, from looking through the code it
appears that haystack always includes any fields that are found in the
index schema. As a result my Solr indexer failed when it encountered a
null date record, as there was an invalid date value being returned in
the document. For dates this appears to be a particularly sticky
problem but it applies to the other fields as well (e.g. for an
integer 0 != null).

So I took a stab at solving the problem.

My solution involved modifying the prepare function in SearchIndex to
check the value of the field before including it in the prepared data
list. The idea being that if a model field or field prepare function
returns None it would be removed from the data included in the
document.

def prepare(self, obj):
"""
Fetches and adds/alters data before indexing.
"""
self.prepared_data = {}

for field_name, field in self.fields.items():
value = field.prepare(obj)
if value != None:
self.prepared_data[field_name] = value

for field_name, field in self.fields.items():
if hasattr(self, "prepare_%s" % field_name):
value = getattr(self, "prepare_%s" % field_name)(obj)
if value != None:
self.prepared_data[field_name] = value

return self.prepared_data


It seemed like a logical place to make the change, however, I don't
how this will impact other backends. An alternative solution would be
do to a backend specific prepare processor (perhaps as a post-
processing step to the general prepare).

If this is reasonable it might be worth merging into the project -
seems like a useful update in terms of Solr implementation
completeness.


Hope this helps!
Kevin



kevin

unread,
Jul 14, 2009, 10:33:13 AM7/14/09
to django-haystack
Hmm... on further review it looks like my solution might not have gone
far enough. It works for Date/DateTime. However, due to the type
casting that occurs in the various type specific SearchField
implementations, it looks like other field types containing null
values get coerced into non-nulls values and thus included in the
document. A solution could be to add null checks to these type
specific implementations. For example:

class IntegerField(SearchField):
def __init__(self, **kwargs):
kwargs['default'] = 0
super(IntegerField, self).__init__(**kwargs)

def prepare(self, obj):
value = super(IntegerField, self).prepare(obj)

if value != None
return int(value)
else:
return None

Daniel Lindsley

unread,
Jul 16, 2009, 12:50:15 AM7/16/09
to django-...@googlegroups.com
Kevin,


Seems somewhat reasonable at first glance, though I think the
better idea here might be to emulate Django's Model fields and add a
`null=true` kwarg, then dealing with it being added to the document
based on that. I'll add an issue for this and try to get to it in the
near-ish future.


Daniel

kevin

unread,
Jul 16, 2009, 12:12:43 PM7/16/09
to django-haystack
Daniel,

Thanks for investigating! I agree with your suggestion about following
the 'null=true' pattern from the model. Regardless, I've patched my
copy with the workaround I posted - so there's no urgency from me.
However, it might be worth noting this limitation in the docs in the
meantime just so others aren't surprised by the behavior.

Thanks again for all your work!
Kevin

Daniel Lindsley

unread,
Jul 19, 2009, 3:19:28 AM7/19/09
to django-...@googlegroups.com
Kevin,


The functionality I think you're looking for has been added to
Haystack. Hope that helps in the future.


Daniel
Reply all
Reply to author
Forward
0 new messages