Two newbie questions

Weef Bellington

unread,

Dec 10, 2009, 2:55:19 PM12/10/09

to Whoosh

Hi, I'm starting to use Whoosh and I have two basic questions:

1) Sometimes when I enter a Unicode string containing a forward slash
as input to a TEXT field, only a part of the string is stored in the
index. Why is this? Does the slash need to be escaped somehow?

2) I've tried using the DATETIME field as described on the website
here: http://files.whoosh.ca/whoosh/docs/latest/changes.html but I'm
getting an error message. I've tried looking at the source but I'm not
yet familiar enough with it to understand what's going on. Here's the
error:

File "build\bdist.win32\egg\whoosh\fields.py", line 168, in __init__
TypeError: __init__() takes at least 2 arguments (1 given)

Any ideas? All help is greatly appreciated! :)

Matt Chaput

unread,

Dec 10, 2009, 3:33:16 PM12/10/09

to who...@googlegroups.com

Weef Bellington wrote:
> 1) Sometimes when I enter a Unicode string containing a forward slash
> as input to a TEXT field, only a part of the string is stored in the
> index. Why is this? Does the slash need to be escaped somehow?

Not sure what you mean by "part of the string", but in the default
analyzer, a slash is considered a word separator, so indexing
u"alfa/bravo" is the same as indexing u"alfa bravo", that is it will be
indexed as two separate "words".

The first thing to try would be making your own analyzer with a custom
RegexTokenizer, using a custom term regex (or, write a regex to match
the "whitespace" between terms and set gaps=True).

from whoosh.analysis import *
# Create a tokenizer using a custom regex
mytokenizer = RegexTokenizer(r"\w+(/?\w+)*")

# Add the filters you want
myanalyzer = (mytokenizer | LowercaseFilter()
| StopFilter() | StemFilter())

# You can test the analyzer like this...
print list(token.text
for token
in myanalyzer(u"How to index this/that and the other"))
# [u'how', u'index', u'this/that', u'other']

# Use your analyzer in a field specification
from whoosh.fields import *
schema = Schema(content=TEXT(analyzer=myanalyzer))

If you needed more complex tokenization than a regular expression, you'd
need to write your own tokenizer class. Check out the whoosh.analysis
module to see what's available in terms of text analysis.

> 2) I've tried using the DATETIME field as described on the website
> here: http://files.whoosh.ca/whoosh/docs/latest/changes.html but I'm
> getting an error message. I've tried looking at the source but I'm not
> yet familiar enough with it to understand what's going on. Here's the
> error:
>

Sorry, the DATETIME field type was a bit of experimentation that I
started and left in an unfinished state :( For now, you should avoid it
and index/store date/time fields yourself using another field type such
as ID, by manually converting your dates to lexically sortable
representations, eg. 20091210.

Hope that helps, let me know if anything isn't clear, or if I missed the
point somewhere. Example code showing the problems helps too :)

Matt

Muayyad AlSadi

unread,

Dec 11, 2009, 3:47:47 AM12/11/09

to who...@googlegroups.com

> Sometimes when I enter a Unicode string containing a forward slash

you may need to look at
http://packages.python.org/Whoosh/querylang.html#escaping-special-characters

David Stemmer

unread,

Dec 14, 2009, 4:02:18 AM12/14/09

to who...@googlegroups.com

Thanks very much both of you. Your comments covered everything I needed to know and then some.

--

You received this message because you are subscribed to the Google Groups "Whoosh" group.
To post to this group, send email to who...@googlegroups.com.
To unsubscribe from this group, send email to whoosh+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/whoosh?hl=en.

Reply all

Reply to author

Forward