Indexing multiple values for a single field ?

Steve Howe

unread,

Mar 23, 2011, 4:20:49 AM3/23/11

to who...@googlegroups.com

Hello,

Is there a way to index multiple vales for a single field ?
For instance, let's say I'm indexing a document with a structure like this:

{
'id': 1,
'name': 'Steve Howe',
'emails': [
{'email': 'xx...@xxx.com'},
{'email': 'xx...@xxx.com'},
{'email': 'xx...@xxx.com'},
{'email': 'xx...@xxx.com'},
],
text: ''
}

Is there a way to index all values of "email" without converting it to
TEXT ? I want to preserve functionality of querying by the email
field; for instance, querying one of these expressions:

email': xx...@xxx.com
email': xx...@xxx.com

... would match the document above.

In Xapian that would be possible by calling document.add_value()
multiple times for each email value, but since the Whoosh API expects
keyword arguments, such as:

writer.add_document(**fields)

... I don't know if it's possible at all.

It would be very useful if the writer.add_document() API accepted
arrays for covering this case, such as:

schema = Schema(
id=ID(stored=True),
name=TEXT(),
emails=ID(),
text=TEXT()
)

#...
w = ix.writer()
w.add_document(id=1, name='Steve Howe', email=['xx...@xxx.com',
'xx...@xxx.com', 'xx...@xxx.com', 'xx...@xxx.com'])

Thanks !
--
Steve Howe
howe...@gmail.com

Mehdy DREF

unread,

Mar 23, 2011, 9:48:39 AM3/23/11

to who...@googlegroups.com, Steve Howe

Hi, it's probably possible to use the "whoosh.fields.KEYWORD" in
your schema.

whoosh.fields.KEYWORD
This type is designed for space- or comma-separated keywords. This
type is indexed and searchable (and optionally stored). To save space,
it does not support phrase searching.

m.

2011/3/23 Steve Howe <howe...@googlemail.com>:

> --
> You received this message because you are subscribed to the Google Groups "Whoosh" group.
> To post to this group, send email to who...@googlegroups.com.
> To unsubscribe from this group, send email to whoosh+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/whoosh?hl=en.
>
>

--
Mehdy DREF
Chef de Projet / Développeur Multimédia Indépendant
http://www.e-magina.fr/
mehdy...@gmail.com

Matt Chaput

unread,

Mar 23, 2011, 9:50:51 AM3/23/11

to who...@googlegroups.com

> Is there a way to index multiple vales for a single field ?
> For instance, let's say I'm indexing a document with a structure like this:
>
> {
> 'id': 1,
> 'name': 'Steve Howe',
> 'emails': [
> {'email': 'xx...@xxx.com'},
> {'email': 'xx...@xxx.com'},
> {'email': 'xx...@xxx.com'},
> {'email': 'xx...@xxx.com'},
> ],
> text: ''
> }
>
> Is there a way to index all values of "email" without converting it to
> TEXT ? I want to preserve functionality of querying by the email
> field; for instance, querying one of these expressions:
>
> email': xx...@xxx.com
> email': xx...@xxx.com
>
> ... would match the document above.

Not really... but converting to text isn't really that bad...
(untested code):

# Make the emails field a KEYWORD field so it treats input strings as a
# space-separated (optionally comma separated) list of opaque terms

schema = Schema(id=ID(stored=True), name=TEXT(stored=True),
emails=KEYWORD(stored=True), text=TEXT)

# ...

emails = ['xx...@xxx.com', 'xx...@xxx.com', 'xx...@xxx.com', 'xx...@xxx.com']
writer.add_document(id=1, name=u"Steve Howe", text=u'',
emails=u" ".join(emails),
_stored_emails=emails) # Store as a Python list

At the beginning of Whoosh I did allow you to pass sequences but decided it overly complicated the indexing code for limited usefulness.

Cheers,

Matt

Steve Howe

unread,

Mar 23, 2011, 1:18:50 PM3/23/11

to who...@googlegroups.com

Thanks, Matt.

That would solve the case for emails, but schema is really more
complicated (I had simplified it for writing the email) - I have to
store multiple dates for each record, and even multiple comments
(whose would have spaces), so KEYWORD would not be really useful. I'd
loose all ID functionality: range search (and even sorting) on dates,
integers etc.

That's very important for those who (like me) lack search
functionality on the backend, such is the case for key-vale stores. I
think I won't be able to use Whoosh because of this :(

Nested schemas (i.e. a Schema as a field of another Schema) would also
be very nice, but I guess that's out of scope as well...
--
Howe
howe...@gmail.com

Mehdy DREF

unread,

Mar 23, 2011, 1:28:52 PM3/23/11

to who...@googlegroups.com, Steve Howe

I don't know if this idea is good but I suggest to use several indexes
in your use case.

For example, you can create an email's index like that : (id,email)
and a comment's index : (id,date,content)

Each id can be store on simple pickle for example.
Finally, when you try to search something, you can associate the id of
the whoosh's result to the id in the pickle.

m.

2011/3/23 Steve Howe <howe...@googlemail.com>:

Matt Chaput

unread,

Mar 23, 2011, 2:17:43 PM3/23/11

to who...@googlegroups.com, howe...@googlemail.com

On 23/03/2011 1:18 PM, Steve Howe wrote:
> That would solve the case for emails, but schema is really more
> complicated (I had simplified it for writing the email) - I have to
> store multiple dates for each record, and even multiple comments
> (whose would have spaces), so KEYWORD would not be really useful. I'd
> loose all ID functionality: range search (and even sorting) on dates,
> integers etc.

Ah, I see. Some thoughts:

* In terms of comments, you can fairly easily index multiple things
containing spaces by using a different separator (such as a NULL), but
I'm not sure why you'd want to index an entire comment as a single term.
Just using a TEXT field and doing u" ".join(comments) seems like it
would be more useful.

* I hadn't thought of the use case of indexing multiple dates per
document before but it's definitely missing functionality. I'll try to
fix this in a future release.

For now, you could create your own DATETIME subclass that accepts a
sequence in addition to a datetime object (untested code):

class MDATETIME(DATETIME):
def index(self, obj):
if isinstance(obj, (list, tuple)):
# If the user passed a list/tuple, call super.index() on
# each date in the list and concatenate the return values
values = []
for dt in obj:
values.extend(super(MDATETIME, self).index(dt))
return values
else:
return super(MDATETIME, self).index(obj)

* What I don't like about supporting an "add more to this field" call is
that currently Writer.add_document() is a one shot action which keeps
the code simple. Adding a second way to add a document (e.g.
writer.start_document(), writer.add_field_data(),
writer.finish_document() ) would complicate the code and documentation.

I wonder if supporting passing a list in place of a string in the
keyword arguments to add_document() would be sufficient to cover most of
the use cases for this.

> Nested schemas (i.e. a Schema as a field of another Schema) would also
> be very nice, but I guess that's out of scope as well...

Hierarchical indexing is definitely out of scope for the forseeable
future. The closest you can get right now is indexing a "sub-document"
with different fields and a "foreign key" relationship with the "parent
document". It's not as fast or easy as the ideal, but it does the job. I
use this technique in the Houdini documentation quite a bit.

For example:

schema = Schema(id=ID, name=TEXT, parent=KEYWORD)

# Documentation for a class
writer.add_document(id="whoosh.writing.Writer", name="Writer")

# Documentation for methods
writer.add_document(id="whoosh.writing.Writer#add_document",
name="add_document",
parent="whoosh.writing.Writer")
writer.add_document(id="whoosh.writing.Writer#commit",
name="commit", parent="whoosh.writing.Writer")

# Find all the methods for the top hit in a search results object
tophit = results[0]
methods = searcher.documents(parent=tophit["id"])

Beyond that you might realize you want a database rather than a
full-text index ;)

Matt

Reply all

Reply to author

Forward