First, thank you very much for the feedback. I appreciate it.
You'll find my replies interspersed.
> 1. The documentation implies that a Xapian backend would not support
> field
> weight boosting. I don't see any reason that this wouldn't be
> possible -
> Xapian supports boosting of weights on a field, or on an individual
> term basis,
> both statically (ie, at index time) and dynamically (at search time).
Great! I couldn't find any mention of boost in the docs, so I didn't
include it. I'll add this to the documentation.
> 2. I note that the HAYSTACK_DEFAULT_OPERATOR setting defaults to OR.
> This
> seems slightly odd to me, since most search engines default to an AND
> operator
> these days, and correspondingly, users tend to expect more search
> terms to
> result in a more specific query, not a more broad one. Is this for
> compatibility with the ORM QuerySet object in some way?
Sorry, I think the was the result of early testing and it never got
changed back. I'm changing it back to 'AND'. If anyone wants to
override this, it's available as a settings
(http://github.com/toastdriven/django-haystack/blob/5b589d19f810ab7242484b7ce8f96962261ada19/haystack/query.py#L181).
The state of highlighting sucks right now, simply because every
backend seems to do it differently. I'd appreciate a pure Python
solution if the licensing can be worked out and this is on my list of
things to look into/think about soonish.
> 4. Documentation on how the facet stuff is intended to work would be
> really
> handy! :) Xappy has extensive facet support - I'd like to see whether
> that
> support matches reasonably with the support offered by haystack.
Again, didn't know Xapian supported facet because I couldn't find it
mentioned. I'll add this to the docs as well.
And I'm still actively working on documentation (one of the reasons I
haven't formally released yet) though I'm not sure I'll hold it up
much longer.
> 5. raw_search() is documented as accepting a string. Not all backends
> will
> have a human-and-machine parseable string representation of the query
> (eg,
> Xapian doesn't), so it might be better to just document the parameters
> to this
> function as being backend specific, too.
Not sure exactly what you mean here. The raw search is intended for
developer use (no user queries here) so the idea was a string that was
extremely specific to your backend.
> 6. When building up queries using .filter(), do the filters contribute
> to the
> relevance order, or just restrict the set of documents in the result?
> I'm
> assuming the former (ie, they have an effect on the weight for each
> document),
> but I've not been able to confirm this by looking at the code or
> documentation
> yet. I think this is a confusion for me because of a terminology
> difference;
> in Xapian, a filter is something which is applied to a query to
> restrict the
> returned set of documents _without_ affecting the weight.
Unfortunately, it's the latter, which is consistent with how Django's
ORM works. In addition, it didn't seem like many engines had a way to
manage relevancy through the query. I think I was thinking this could
be managed with boosting but I'm not sure how that fits into your
world view.
> 7. Is there any support for range searches, currently? Eg, for
> storing a
> numeric value in the database, and performing a search to return only
> documents
> for which that value lies between two values.
The supported operators are 'exact', 'gt', 'lt', 'gte', 'lte' and
'in'. So yes, there is range support.
> 8. Is there any support for synonym expansion? There are two things I
> was
> looking for; automatically expanding queries to use synoynm terms, and
> for
> adjusting the weight such that multiple terms can be combined to
> behave like a
> single term for weighting purposes.
Sorry, I'm at a loss here. The answer is no, because I'm not really
sure what you're talking about. If you can point me to
documentation/explanation, I'd love to read up on it.
> 9. Is there any support for spelling correction?
Sorry, not in the current version there isn't. Only Whoosh supports
spelling correction among the current backends (Solr & Lucene don't
have it that I've seen).
Thank you very much again and I hope these answers help a little, even
if they're not what you hoped for. Haystack is really meant to be an
80% solution that handles most cases. That's not to say it can't
handle more advanced things but you'd lose the portability and have to
use custom code (inherit and extend in many cases).
Daniel
>> field_data['type'] = field_object.get_type()
>
> And define get_type to return 'date' for the DateField and
> DateTimeField.
I like this idea best. I'll give it more consideration, though these
cases should cover any fields which inherit from existing fields.
> For MultiValuedField, it looks like the backends expect a comma
> separated list of values. It would seem more natural to me for this
> to return a sequence (or iterator) of values.
From the backends I've worked with so far, you usually need the full
list anyway. I'm not disputing the performance/memory benefits, and
will consider this as well.
> For the SearchResult class, it might be nice if the properties on the
> object could be accessed directly through the SearchResult object,
> rather than through the .object. accessor, when they don't conflict
> with the names of fields retrieved from the search engine. Not sure
> about this, but the current approach seems clunky somehow.
Good news! The future is already here!
(http://github.com/toastdriven/django-haystack/blob/5b589d19f810ab7242484b7ce8f96962261ada19/haystack/models.py#L23-25)
That will assign any attributes that come back (that don't conflict)
to the result itself. The `.object` lookup is only once you have to
hit the DB for more information.
Daniel
On the topic of `raw_search`, I think making what it accepts
ambiguous may be fine but I really want to think on it some more.
While I understand the use case, I'm tentative that things might get
out of hand if anything and everything are allowed in. Maybe that
concern is unwarranted, but I want to make sure.
Thanks for the explanation on synonym expansion. Again, this is a
feature I haven't seen in other places, so maybe this is a good use
case for an advanced, sub-classed SearchQuerySet. This is an idea I've
been kicking around for awhile, wondering if perhaps a split between a
basic (core functionality everything provides) and advanced (spotty
but powerful support, here there be dragons) SearchQuerySet might be a
good idea. I'm not sold one way or another because there are benefits
and detriments to be had either way.
>> >> field_data['type'] = field_object.get_type()
>> I like this idea best. I'll give it more consideration, though these
>> cases should cover any fields which inherit from existing fields.
>
> If you just implement support for setting existing properties, that's
> a bit limiting, though. For example, if I want to add a "geo" field
> type (for geospatial matching) for a particular backend, and also pass
> several parameters from the Field object through to my backend to
> control, say, how accurately I want to store positions, I can't do
> that unless there's a general callback like the "update_field_data" I
> suggested.
A single callback that modifies all of the field data scares me. I
understand there's potential power there, but there's also a huge
chance of error or malicious use. Further, the order fields are
processed in is not guaranteed (based on a standard dict), so there's
unknowns there too. However, I agree the existing solution is less
than wonderful and that this will require more thought.
One possibility is that a `SearchSite` may maintain a mapping of
fields/representations that allows extension to convert formats. I'm
not in love with this (big maintenance overhead and more difficult to
extend) but it's a thought.
Finally, another thing that concerns me here is what additional types
are really missing. For example, how is this geo data sent/stored by
the backend? Does that data not fit into a
text/int/bool/float/date/multi bucket? This is also a potential
problem in that we might start getting into formats that not all
backends support.
As an aside, I'm having trouble finding much of anything that
demostrates how a Xapian schema is built. None of the docs, nor code
samples from the docs, nor the first 10 pages of Google are yielding a
whole lot for me. At the risk of looking dense, a separate e-mail
thread about this would be appreciated.
>> > For MultiValuedField, it looks like the backends expect a comma
>> > separated list of values. It would seem more natural to me for this
>> > to return a sequence (or iterator) of values.
>>
>> From the backends I've worked with so far, you usually need the full
>> list anyway. I'm not disputing the performance/memory benefits, and
>> will consider this as well.
>
> I was more thinking that it would be nice to be able to have commas in
> the items in the list. (Maybe you can already with some kind of
> escaping - but even if so, that's horrible!)
I'm sorry but I need to see sample code to understand this one. The
MultiValuedField is simply meant to handle list (or list-like) data,
with no preference on providing a full list versus iteration. The
`prepare` method on that field converts to a list, which removes the
benefits of an iterator, but you could easily create your own field
that doesn't do that. And I'm not aware of any part of this that does
anything with commas at all. You usually hand it a Python list, and
whatever binding the `SearchBackend` uses usually handles the
conversion to what the backend expects. Forgive me if I'm
misunderstanding and I'd appreciate any clarification you may have.
> That's only for properties stored in the search engine, as far as I
> can see. I meant that I'd like it to hit the DB lookup automatically,
> if I accessed a property not stored in the search engine.
Sorry, this is a wontfix to me. Automatically hitting the DB causes an
ORM `get` for each item (because it could be anything). This is
behavior that some people wouldn't want and there's a big chance of
collisions and unexpected behavior. This isn't necessarily a final
say; simply that I'm strongly -1 on this behavior.
Finally, I'm sorry that my answers to so many of these are "need
further thought". API design is difficult and making sure the code
stands up to what's demanded of it in the future is equally difficult.
Daniel