Some thoughts, both general and Xapian related

105 views
Skip to first unread message

Richard Boulton

unread,
Apr 13, 2009, 10:46:54 AM4/13/09
to django-haystack
I've been spending some time today implementing a haystack based
search of an
existing site, in order to gain a better understanding of haystack,
and with a
view to starting to write a Xapian/Xappy backend. Here are some
thoughts so
far, in no particular order:

1. The documentation implies that a Xapian backend would not support
field
weight boosting. I don't see any reason that this wouldn't be
possible -
Xapian supports boosting of weights on a field, or on an individual
term basis,
both statically (ie, at index time) and dynamically (at search time).

2. I note that the HAYSTACK_DEFAULT_OPERATOR setting defaults to OR.
This
seems slightly odd to me, since most search engines default to an AND
operator
these days, and correspondingly, users tend to expect more search
terms to
result in a more specific query, not a more broad one. Is this for
compatibility with the ORM QuerySet object in some way?

3. The highlight method doesn't allow the manner in which the
highlighting is
applied to be customised at all. Xappy's highlighting supports an
"hl"
parameter, which is used to specify the strings to wrap highlighed
words in,
which seems useful. Even more useful might be to have a callback of
some kind
to allow different words to be highlighted in different ways (eg,
different
colours).

It's also often useful to be able to highlight some fields in the
result, but
just pass through others unchanged - eg, highlight the "content"
field, but
pass date fields through. I'm not sure; would that be possible with
haystack
currently?

Actually, the highlighting code in Xappy is pure-python, and my
company has
full copyright in it. If there's interest, I could remove the xapian
specific
code (about 3 lines, for extracting the terms from a xapian query) get
it
relicensed appropriately, and it could be included in haystack as a
default
implementation for backends which don't have their own highlighting
support.
It also includes code to make a query-specific summary, as well as
highlighting.

4. Documentation on how the facet stuff is intended to work would be
really
handy! :) Xappy has extensive facet support - I'd like to see whether
that
support matches reasonably with the support offered by haystack.

5. raw_search() is documented as accepting a string. Not all backends
will
have a human-and-machine parseable string representation of the query
(eg,
Xapian doesn't), so it might be better to just document the parameters
to this
function as being backend specific, too.

6. When building up queries using .filter(), do the filters contribute
to the
relevance order, or just restrict the set of documents in the result?
I'm
assuming the former (ie, they have an effect on the weight for each
document),
but I've not been able to confirm this by looking at the code or
documentation
yet. I think this is a confusion for me because of a terminology
difference;
in Xapian, a filter is something which is applied to a query to
restrict the
returned set of documents _without_ affecting the weight.

7. Is there any support for range searches, currently? Eg, for
storing a
numeric value in the database, and performing a search to return only
documents
for which that value lies between two values.

8. Is there any support for synonym expansion? There are two things I
was
looking for; automatically expanding queries to use synoynm terms, and
for
adjusting the weight such that multiple terms can be combined to
behave like a
single term for weighting purposes.

9. Is there any support for spelling correction?

That's probably enough to be going on with...!

--
Richard

Richard Boulton

unread,
Apr 13, 2009, 5:27:02 PM4/13/09
to django-haystack
To follow up from this, some further thoughts:

I think the set of fields available are a bit restricted. It would be
good if backends could define their own, to correspond with the
features they provide - currently, sites.py builds the schema by
performing a set of isinstance tests on the field object. This would
be better replaced by calling methods on the Field objects to
determine particular properties. For example, instead of:

if isinstance(field_object, DateField) or isinstance(field_object,
DateTimeField):
field_data['type'] = 'date'


do

field_data['type'] = field_object.get_type()

And define get_type to return 'date' for the DateField and
DateTimeField.

Or possibly, more flexible, just have a single call to the
field_object which is passed the field_data, and updates it
appropriately. eg:

field_object.update_field_data(field_data)


For MultiValuedField, it looks like the backends expect a comma
separated list of values. It would seem more natural to me for this
to return a sequence (or iterator) of values.


For the SearchResult class, it might be nice if the properties on the
object could be accessed directly through the SearchResult object,
rather than through the .object. accessor, when they don't conflict
with the names of fields retrieved from the search engine. Not sure
about this, but the current approach seems clunky somehow.

Daniel Lindsley

unread,
Apr 14, 2009, 1:15:32 AM4/14/09
to django-...@googlegroups.com
Richard,


First, thank you very much for the feedback. I appreciate it.
You'll find my replies interspersed.


> 1. The documentation implies that a Xapian backend would not support
> field
> weight boosting.  I don't see any reason that this wouldn't be
> possible -
> Xapian supports boosting of weights on a field, or on an individual
> term basis,
> both statically (ie, at index time) and dynamically (at search time).

Great! I couldn't find any mention of boost in the docs, so I didn't
include it. I'll add this to the documentation.


> 2. I note that the HAYSTACK_DEFAULT_OPERATOR setting defaults to OR.
> This
> seems slightly odd to me, since most search engines default to an AND
> operator
> these days, and correspondingly, users tend to expect more search
> terms to
> result in a more specific query, not a more broad one.  Is this for
> compatibility with the ORM QuerySet object in some way?

Sorry, I think the was the result of early testing and it never got
changed back. I'm changing it back to 'AND'. If anyone wants to
override this, it's available as a settings
(http://github.com/toastdriven/django-haystack/blob/5b589d19f810ab7242484b7ce8f96962261ada19/haystack/query.py#L181).

The state of highlighting sucks right now, simply because every
backend seems to do it differently. I'd appreciate a pure Python
solution if the licensing can be worked out and this is on my list of
things to look into/think about soonish.

> 4. Documentation on how the facet stuff is intended to work would be
> really
> handy! :)  Xappy has extensive facet support - I'd like to see whether
> that
> support matches reasonably with the support offered by haystack.

Again, didn't know Xapian supported facet because I couldn't find it
mentioned. I'll add this to the docs as well.

And I'm still actively working on documentation (one of the reasons I
haven't formally released yet) though I'm not sure I'll hold it up
much longer.


> 5. raw_search() is documented as accepting a string.  Not all backends
> will
> have a human-and-machine parseable string representation of the query
> (eg,
> Xapian doesn't), so it might be better to just document the parameters
> to this
> function as being backend specific, too.

Not sure exactly what you mean here. The raw search is intended for
developer use (no user queries here) so the idea was a string that was
extremely specific to your backend.


> 6. When building up queries using .filter(), do the filters contribute
> to the
> relevance order, or just restrict the set of documents in the result?
> I'm
> assuming the former (ie, they have an effect on the weight for each
> document),
> but I've not been able to confirm this by looking at the code or
> documentation
> yet.  I think this is a confusion for me because of a terminology
> difference;
> in Xapian, a filter is something which is applied to a query to
> restrict the
> returned set of documents _without_ affecting the weight.

Unfortunately, it's the latter, which is consistent with how Django's
ORM works. In addition, it didn't seem like many engines had a way to
manage relevancy through the query. I think I was thinking this could
be managed with boosting but I'm not sure how that fits into your
world view.


> 7. Is there any support for range searches, currently?  Eg, for
> storing a
> numeric value in the database, and performing a search to return only
> documents
> for which that value lies between two values.

The supported operators are 'exact', 'gt', 'lt', 'gte', 'lte' and
'in'. So yes, there is range support.


> 8. Is there any support for synonym expansion?  There are two things I
> was
> looking for; automatically expanding queries to use synoynm terms, and
> for
> adjusting the weight such that multiple terms can be combined to
> behave like a
> single term for weighting purposes.

Sorry, I'm at a loss here. The answer is no, because I'm not really
sure what you're talking about. If you can point me to
documentation/explanation, I'd love to read up on it.


> 9. Is there any support for spelling correction?

Sorry, not in the current version there isn't. Only Whoosh supports
spelling correction among the current backends (Solr & Lucene don't
have it that I've seen).


Thank you very much again and I hope these answers help a little, even
if they're not what you hoped for. Haystack is really meant to be an
80% solution that handles most cases. That's not to say it can't
handle more advanced things but you'd lose the portability and have to
use custom code (inherit and extend in many cases).


Daniel

Daniel Lindsley

unread,
Apr 14, 2009, 1:32:59 AM4/14/09
to django-...@googlegroups.com
Richard,


>> field_data['type'] = field_object.get_type()
>
> And define get_type to return 'date' for the DateField and
> DateTimeField.

I like this idea best. I'll give it more consideration, though these
cases should cover any fields which inherit from existing fields.


> For MultiValuedField, it looks like the backends expect a comma
> separated list of values.  It would seem more natural to me for this
> to return a sequence (or iterator) of values.

From the backends I've worked with so far, you usually need the full
list anyway. I'm not disputing the performance/memory benefits, and
will consider this as well.


> For the SearchResult class, it might be nice if the properties on the
> object could be accessed directly through the SearchResult object,
> rather than through the .object. accessor, when they don't conflict
> with the names of fields retrieved from the search engine.  Not sure
> about this, but the current approach seems clunky somehow.

Good news! The future is already here!
(http://github.com/toastdriven/django-haystack/blob/5b589d19f810ab7242484b7ce8f96962261ada19/haystack/models.py#L23-25)
That will assign any attributes that come back (that don't conflict)
to the result itself. The `.object` lookup is only once you have to
hit the DB for more information.


Daniel

Simon Willison

unread,
Apr 14, 2009, 3:21:24 AM4/14/09
to django-haystack
On Apr 14, 6:15 am, Daniel Lindsley <polarc...@gmail.com> wrote:
> > 9. Is there any support for spelling correction?
>
> Sorry, not in the current version there isn't. Only Whoosh supports
> spelling correction among the current backends (Solr & Lucene don't
> have it that I've seen).

Solr has a SpellCheckComponent:

http://wiki.apache.org/solr/SpellCheckComponent

It's built on top of the Lucene SpellChecker class:

http://wiki.apache.org/jakarta-lucene/SpellChecker

Cheers,

Simon (enthusiastically stalking haystack)

Richard Boulton

unread,
Apr 14, 2009, 3:28:37 AM4/14/09
to django-haystack
On Tue, Apr 14, 2009 at 12:15:32AM -0500, Daniel Lindsley wrote:
> > 5. raw_search() is documented as accepting a string.  Not all backends
> > will
> > have a human-and-machine parseable string representation of the query
> > (eg,
> > Xapian doesn't), so it might be better to just document the parameters
> > to this
> > function as being backend specific, too.
>
> Not sure exactly what you mean here. The raw search is intended for
> developer use (no user queries here) so the idea was a string that was
> extremely specific to your backend.

With Xapian, there isn't a string representation for queries which can
represent all the features available (even for developer type use).
So for a Xapian backend, I'd probably want to use something like
raw_search(), but pass it some objects representing the query, rather
than a string.

> > 6. When building up queries using .filter(), do the filters contribute
> > to the
> > relevance order, or just restrict the set of documents in the result?
> > I'm
> > assuming the former (ie, they have an effect on the weight for each
> > document),
> > but I've not been able to confirm this by looking at the code or
> > documentation
> > yet.  I think this is a confusion for me because of a terminology
> > difference;
> > in Xapian, a filter is something which is applied to a query to
> > restrict the
> > returned set of documents _without_ affecting the weight.
>
> Unfortunately, it's the latter, which is consistent with how Django's
> ORM works. In addition, it didn't seem like many engines had a way to
> manage relevancy through the query. I think I was thinking this could
> be managed with boosting but I'm not sure how that fits into your
> world view.

Consistency with Django is more important than consistency with
Xapian, here, so that's fine. You can make all filters on a field
into a xapian-style filter by using .boost(field=0), if I understand
how boost is meant to work, anyway (ie, multiplying the weight of
search terms in that field by 0).

> > 7. Is there any support for range searches, currently?  Eg, for
> > storing a
> > numeric value in the database, and performing a search to return only
> > documents
> > for which that value lies between two values.
>
> The supported operators are 'exact', 'gt', 'lt', 'gte', 'lte' and
> 'in'. So yes, there is range support.

Aha. Thanks, that makes sense.

> > 8. Is there any support for synonym expansion?  There are two things I
> > was
> > looking for; automatically expanding queries to use synoynm terms, and
> > for
> > adjusting the weight such that multiple terms can be combined to
> > behave like a
> > single term for weighting purposes.
>
> Sorry, I'm at a loss here. The answer is no, because I'm not really
> sure what you're talking about. If you can point me to
> documentation/explanation, I'd love to read up on it.

The idea is that users often don't type the exact words that are in a
document, but instead use different words which have the same meaning
(or
similar enough for our purposes). So, it can be useful to have a
synonym
dictionary, and apply it at search time to "expand" the query to cover
the
related words.

For example, if I have a database of fabrics, they might have colour
descriptions like "red", "rouge", "ruby". If I search for "red", I'd
like
the "rouge" and "ruby" documents to come back too. So, the developers
of
the search engine can make this happen by adding "rouge" and "ruby" as
synonyms of "red".

That's the first part I mentioned "automatically expanding queries to
use
synoynm terms". Done naively, this would result in a search like:

red OR rouge OR ruby

For this search (as with any OR search), the weight of a result is the
sum
of the weights from the terms in the query (adjusted based on the
frequency
of each term in the result document).

Now, suppose that "red" is a common term, but "ruby" is a rare term.
This
means (with Xapian's weighting algorithm, and most other algorithms in
use)
that ruby will be considered a more significant term, and get a higher
weight. This means that a document containing the word "ruby" will
get a
higher weight than one contiaining "red" - which seems wrong, because
"red"
was exactly the word used in the query. So, what we actually want to
do is
combine the three synonyms here into a single "virtual term", and
apply a
weight based on the frequency of that combined "virtual term". In
Xapian,
(though I admit, only with some code which is only currently available
through the xappy branch) you can do:

red SYNONYM rouge SYNONYM ruby

to do this. Actually, I'd normally produce a query like:

red OR (red SYNONYM rouge SYNONYM ruby)

Which will weight a document with an exact match for the search term
"red"
higher than one with only a match for a synonym of "red".

In Xapian, you can build up the synonym directory by calling
"add_synonym()" on the database, and then pass a flag to the query
parser
to enable this behaviour.

Some minimal documentation for this feature in Xapian:
http://xapian.org/docs/synonyms.html

> > 9. Is there any support for spelling correction?
>
> Sorry, not in the current version there isn't. Only Whoosh supports
> spelling correction among the current backends (Solr & Lucene don't
> have it that I've seen).
>
>
> Thank you very much again and I hope these answers help a little, even
> if they're not what you hoped for. Haystack is really meant to be an
> 80% solution that handles most cases. That's not to say it can't
> handle more advanced things but you'd lose the portability and have to
> use custom code (inherit and extend in many cases).

That's entirely sensible; as long as the appropriate places to inherit
and
extend are available, anyway. I'm hoping that we can also ensure that
when
several backends support the same non-core feature we can implement
support
compatibly, so we only lose portability when features aren't shared.

--
Richard

Richard Boulton

unread,
Apr 14, 2009, 3:34:57 AM4/14/09
to django-haystack
On Apr 14, 6:32 am, Daniel Lindsley <polarc...@gmail.com> wrote:
> >> field_data['type'] = field_object.get_type()
> I like this idea best. I'll give it more consideration, though these
> cases should cover any fields which inherit from existing fields.

If you just implement support for setting existing properties, that's
a bit limiting, though. For example, if I want to add a "geo" field
type (for geospatial matching) for a particular backend, and also pass
several parameters from the Field object through to my backend to
control, say, how accurately I want to store positions, I can't do
that unless there's a general callback like the "update_field_data" I
suggested.

> > For MultiValuedField, it looks like the backends expect a comma
> > separated list of values.  It would seem more natural to me for this
> > to return a sequence (or iterator) of values.
>
> From the backends I've worked with so far, you usually need the full
> list anyway. I'm not disputing the performance/memory benefits, and
> will consider this as well.

I was more thinking that it would be nice to be able to have commas in
the items in the list. (Maybe you can already with some kind of
escaping - but even if so, that's horrible!)

> > For the SearchResult class, it might be nice if the properties on the
> > object could be accessed directly through the SearchResult object,
> > rather than through the .object. accessor, when they don't conflict
> > with the names of fields retrieved from the search engine.  Not sure
> > about this, but the current approach seems clunky somehow.
>
> Good news! The future is already here!
> (http://github.com/toastdriven/django-haystack/blob/5b589d19f810ab7242...)
> That will assign any attributes that come back (that don't conflict)
> to the result itself. The `.object` lookup is only once you have to
> hit the DB for more information.

That's only for properties stored in the search engine, as far as I
can see. I meant that I'd like it to hit the DB lookup automatically,
if I accessed a property not stored in the search engine.

Daniel Lindsley

unread,
Apr 15, 2009, 12:30:37 AM4/15/09
to django-...@googlegroups.com
Simon,


I stand corrected. I'll look into this more this weekend and think
about an API, since people seem interested.


Daniel

Daniel Lindsley

unread,
Apr 15, 2009, 1:21:38 AM4/15/09
to django-...@googlegroups.com
Richard,


On the topic of `raw_search`, I think making what it accepts
ambiguous may be fine but I really want to think on it some more.
While I understand the use case, I'm tentative that things might get
out of hand if anything and everything are allowed in. Maybe that
concern is unwarranted, but I want to make sure.

Thanks for the explanation on synonym expansion. Again, this is a
feature I haven't seen in other places, so maybe this is a good use
case for an advanced, sub-classed SearchQuerySet. This is an idea I've
been kicking around for awhile, wondering if perhaps a split between a
basic (core functionality everything provides) and advanced (spotty
but powerful support, here there be dragons) SearchQuerySet might be a
good idea. I'm not sold one way or another because there are benefits
and detriments to be had either way.


>> >> field_data['type'] = field_object.get_type()
>> I like this idea best. I'll give it more consideration, though these
>> cases should cover any fields which inherit from existing fields.
>
> If you just implement support for setting existing properties, that's
> a bit limiting, though.  For example, if I want to add a "geo" field
> type (for geospatial matching) for a particular backend, and also pass
> several parameters from the Field object through to my backend to
> control, say, how accurately I want to store positions, I can't do
> that unless there's a general callback like the "update_field_data" I
> suggested.

A single callback that modifies all of the field data scares me. I
understand there's potential power there, but there's also a huge
chance of error or malicious use. Further, the order fields are
processed in is not guaranteed (based on a standard dict), so there's
unknowns there too. However, I agree the existing solution is less
than wonderful and that this will require more thought.

One possibility is that a `SearchSite` may maintain a mapping of
fields/representations that allows extension to convert formats. I'm
not in love with this (big maintenance overhead and more difficult to
extend) but it's a thought.

Finally, another thing that concerns me here is what additional types
are really missing. For example, how is this geo data sent/stored by
the backend? Does that data not fit into a
text/int/bool/float/date/multi bucket? This is also a potential
problem in that we might start getting into formats that not all
backends support.

As an aside, I'm having trouble finding much of anything that
demostrates how a Xapian schema is built. None of the docs, nor code
samples from the docs, nor the first 10 pages of Google are yielding a
whole lot for me. At the risk of looking dense, a separate e-mail
thread about this would be appreciated.


>> > For MultiValuedField, it looks like the backends expect a comma
>> > separated list of values.  It would seem more natural to me for this
>> > to return a sequence (or iterator) of values.
>>
>> From the backends I've worked with so far, you usually need the full
>> list anyway. I'm not disputing the performance/memory benefits, and
>> will consider this as well.
>
> I was more thinking that it would be nice to be able to have commas in
> the items in the list.  (Maybe you can already with some kind of
> escaping - but even if so, that's horrible!)

I'm sorry but I need to see sample code to understand this one. The
MultiValuedField is simply meant to handle list (or list-like) data,
with no preference on providing a full list versus iteration. The
`prepare` method on that field converts to a list, which removes the
benefits of an iterator, but you could easily create your own field
that doesn't do that. And I'm not aware of any part of this that does
anything with commas at all. You usually hand it a Python list, and
whatever binding the `SearchBackend` uses usually handles the
conversion to what the backend expects. Forgive me if I'm
misunderstanding and I'd appreciate any clarification you may have.


> That's only for properties stored in the search engine, as far as I
> can see.  I meant that I'd like it to hit the DB lookup automatically,
> if I accessed a property not stored in the search engine.

Sorry, this is a wontfix to me. Automatically hitting the DB causes an
ORM `get` for each item (because it could be anything). This is
behavior that some people wouldn't want and there's a big chance of
collisions and unexpected behavior. This isn't necessarily a final
say; simply that I'm strongly -1 on this behavior.


Finally, I'm sorry that my answers to so many of these are "need
further thought". API design is difficult and making sure the code
stands up to what's demanded of it in the future is equally difficult.


Daniel

Richard Boulton

unread,
Apr 15, 2009, 3:08:35 AM4/15/09
to django-haystack
On Wed, Apr 15, 2009 at 12:21:38AM -0500, Daniel Lindsley wrote:
> A single callback that modifies all of the field data scares me. I
> understand there's potential power there, but there's also a huge
> chance of error or malicious use. Further, the order fields are
> processed in is not guaranteed (based on a standard dict), so there's
> unknowns there too. However, I agree the existing solution is less
> than wonderful and that this will require more thought.

I don't think the order in which fields are processed being
unspecified is a problem - the callback would only set data for that
particular field, so it doesn't matter what order the fields are
processed in.

I do understand that this could get out of hand; I was thinking that
new field types would only be implemented by backends, in conjunction
with the support for handling that field type in the backend. I
suppose it's possible that user code might define it's own field
types, though.

I'm don't understand what you mean by "malicious" use, though.

As an aside; I've just noticed that solr_backend.py doesn't use
build_unified_schema(). Does it look at the field contents directly,
or does it just use a pre-defined schema for now?

> One possibility is that a `SearchSite` may maintain a mapping of
> fields/representations that allows extension to convert formats. I'm
> not in love with this (big maintenance overhead and more difficult to
> extend) but it's a thought.
>
> Finally, another thing that concerns me here is what additional types
> are really missing. For example, how is this geo data sent/stored by
> the backend? Does that data not fit into a
> text/int/bool/float/date/multi bucket? This is also a potential
> problem in that we might start getting into formats that not all
> backends support.

Well, pretty much any data could be put into a text representation.
The field types are more to tell the backend how to handle the data;
for example, for a geo type, Xapian has special handling to index it
as a pair of coordinates, and also to generate a hierarchical set of
terms which can be used to quickly filter a search to find candidate
documents within a given range.

I think that what's missing isn't so much "fundamental" types,
(string, int, float, etc), but a way to pass general information about
how to index a field to the backend. For example, for a "float"
field, with Xapian, I'd want to be able to tell the backend whether I
want to be able to perform a range search with this field (and if so,
what some common ranges to want to search are, so that Xapian can
index appropriately), whether I want to be able to sort results with
this field, whether I want to be able to do exact matching with this
field, and whether I want to be able to use the field contents as a
weight boost of documents. All of these will be indexed differently
by Xapian, and while we could make the backend perform all of the
actions at once, this would lead to unfeasibly large databases. Other
backends might not support all of these options (eg, with whoosh, I
think it's currently hard to sort on a float value). Passing such
options would be necessarily backend specific, but the backend
specific code written by users would be restricted to the field types
used, and the parameters passed to them.

This support certainly would start getting us into formats that not
all backends support. I suppose the decision is whether whoosh should
aim to support only a "lowest common denominator" set of features and
force users to use something else to get at backend specific features,
or whether it should allow backends to provide their own features, at
the expense of "non-core" features not being backend portable.

My vote is for the latter, but with a strong push to "standardise"
features into the core set as much as possible (ie, to try and
implement support for a new features in such a way that it's possible
to implement them for more than one backend, even if only one backend
supports them initially).

While we're on the subject; having multi as a separate type struck me
as odd - it seems more natural for it to be a property which can be
applied to any of the existing types. ie; allow CharField
(multi=True), or FloatField(multi=True). Maybe I'm misunderstanding
how MultiValueField is meant to work, though.

> > I was more thinking that it would be nice to be able to have commas in
> > the items in the list.  (Maybe you can already with some kind of
> > escaping - but even if so, that's horrible!)
>
> I'm sorry but I need to see sample code to understand this one. The
> MultiValuedField is simply meant to handle list (or list-like) data,
> with no preference on providing a full list versus iteration. The
> `prepare` method on that field converts to a list, which removes the
> benefits of an iterator, but you could easily create your own field
> that doesn't do that. And I'm not aware of any part of this that does
> anything with commas at all. You usually hand it a Python list, and
> whatever binding the `SearchBackend` uses usually handles the
> conversion to what the backend expects. Forgive me if I'm
> misunderstanding and I'd appreciate any clarification you may have.

Perhaps I'm being confused. I've not actually used MultiValueField
yet (though I have a couple of places where I'd like to). I was
looking at these lines in whoosh_backend.py:

if field['multi_valued'] is True:
schema_fields[field['field_name']] = KEYWORD(stored=True,
comma=True)

And drew the conclusion that this meant that the field was passed to
the backend as a comma separated list. Perhaps I was mistaken; I've
not quite worked out how this works, yet.

> > That's only for properties stored in the search engine, as far as I
> > can see.  I meant that I'd like it to hit the DB lookup automatically,
> > if I accessed a property not stored in the search engine.
>
> Sorry, this is a wontfix to me. Automatically hitting the DB causes an
> ORM `get` for each item (because it could be anything). This is
> behavior that some people wouldn't want and there's a big chance of
> collisions and unexpected behavior. This isn't necessarily a final
> say; simply that I'm strongly -1 on this behavior.

Fair enough. The lookup would only happen if one of the fields which
was not stored in the search engine was accessed, but you're right
that this could lead to unpredicatable behaviour, and is probably a
bit too much magic.

However, I thought that this is what the load_all() method was for -
to ensure that only one DB hit (for each model) happened.

While I'm looking at that bit of code (SearchResult.__init__):

for key, value in kwargs.items():
if not key in self.__dict__:
self.__dict__[key] = value

It looks like it's currently impossible to get at stored fields in the
search engine whose names conflict with any of the members of
SearchResult; eg "object", "model", "content_type", "score". Some of
these certainly are likely to be of use - we need to document this
restriction (unless I'm missing something), and/or provide a way to
access fields which conflict with members of SearchResult directly.

> Finally, I'm sorry that my answers to so many of these are "need
> further thought". API design is difficult and making sure the code
> stands up to what's demanded of it in the future is equally difficult.

Sure - in your defence, I'm picking on the bits which either don't
seem right to me, or are not obvious. I'm ignoring the bulk of the
code which looks fine. :)

--
Richard
Reply all
Reply to author
Forward
0 new messages