hate to be a pain: whoosh vs. nucular?

45 views
Skip to first unread message

Aaron Watters

unread,
May 7, 2009, 5:13:41 PM5/7/09
to Whoosh
Sorry, but I have to ask.

On what basis do you claim that whoosh is "much faster" than nucular?
Have you tested it on 4 terabytes of email data? How did it do?

I suspect that the two packages have different advantages and
disadvantages,
but nucular is quite fast (if not perhaps precisely comparable).

-- Aaron Watters
http://nucular.sourceforge.net
http://aaron.oirt.rutgers.edu/myapp/docs/W1500.whyIsWhiffCool

Matt Chaput

unread,
May 7, 2009, 5:36:43 PM5/7/09
to who...@googlegroups.com
Aaron Watters wrote:
> Sorry, but I have to ask.
>
> On what basis do you claim that whoosh is "much faster" than nucular?
> Have you tested it on 4 terabytes of email data? How did it do?
>
Fully on the basis of complete ignorance of its existence. :) I've
never heard tell of it before or seen it in the PyPI RSS feed. The only
pure-Python search libraries I was aware of were the discontinued Lupy,
the pre-Xapian itools, and a couple of others i can't think of right
now, all of which were really quite slow (mostly due to creating tons of
short-lived objects all over the place).

Not only do I not have 4 TB of email data, but I would never even
consider indexing it with a pure-Python solution if I did. If Nucular
can really index something like that from scratch in anything less than
geological time periods, I'm astounded. I'm excited to check this out now!

By the way, I've been trying to find time to set up some kind of
benchmarking harness for various Python search solutions (e.g. PyLucene,
xappy, textindexng), especially since I'm working on performance
improvements. I don't suppose Nucular has something like that already?
Anyway, I'll include Nucular in whatever custom thing I come up with.

Cheers,

Matt


Aaron Watters

unread,
May 7, 2009, 5:54:53 PM5/7/09
to Whoosh


On May 7, 5:36 pm, Matt Chaput <m...@whoosh.ca> wrote:
> Aaron Watters wrote:

> Not only do I not have 4 TB of email data, but I would never even
> consider indexing it with a pure-Python solution if I did. If Nucular
> can really index something like that from scratch in anything less than
> geological time periods, I'm astounded. I'm excited to check this out now!

You are right. I was off by 3 orders of magnitude
I was referring to the enron email corpus

http://www.cs.cmu.edu/~enron/

(I hope it's still there). I think it's 4 Gig (not T)
The nucular test script for it is here

http://aaron.oirt.rutgers.edu/cgi-bin/nucularRepo.cgi/file/2e6db8636331/test/enronTest.py

It takes several hours to build the archive, but when it's done,
it finds things fast.

Also look at the dblp reference archive

http://dblp.uni-trier.de/xml/
http://aaron.oirt.rutgers.edu/cgi-bin/nucularRepo.cgi/file/2e6db8636331/test/dblpTest.py

I think you'll find that the underlying goals and models for the
approaches
are different (I'm not a big fan of "approximate" retrieval), so it
might
be hard to make an exact comparison.

All the best -- Aaron Watters
http://aaron.oirt.rutgers.edu/myapp/docs/W1500.whyIsWhiffCool



Robert Kern

unread,
May 9, 2009, 5:20:50 PM5/9/09
to who...@googlegroups.com
On Thu, May 7, 2009 at 16:54, Aaron Watters <aaron....@gmail.com> wrote:
>
> On May 7, 5:36 pm, Matt Chaput <m...@whoosh.ca> wrote:
>> Aaron Watters wrote:
>
>> Not only do I not have 4 TB of email data, but I would never even
>> consider indexing it with a pure-Python solution if I did. If Nucular
>> can really index something like that from scratch in anything less than
>> geological time periods, I'm astounded. I'm excited to check this out now!
>
> You are right. I was off by 3 orders of magnitude
> I was referring to the enron email corpus
>
>   http://www.cs.cmu.edu/~enron/
>
> (I hope it's still there).  I think it's 4 Gig (not T)

1.7 G for ~517000 messages.

On my Mac Pro machine, Nucular (hg tip) took about 60 minutes to
index. Whoosh (svn trunk) took at least twice as long (I need to wait
until I get back to the office Monday to check, but the ETA was about
2.5 hours). Attached is the code I used to generate the Whoosh index.
Increasing the number of messages for a given commit appears to
improve index times, but when I tried doing all messages in a single
commit, I ran into "too many open files" errors.

It is difficult to compare search times because Whoosh pulls out the
actual text of the results lazily and Nucular doesn't. By forcing the
creation of a list with the actual result dictionaries, I get that
Whoosh is much faster by a factor of 4-5.


In [31]: from nucular import Nucular

In [32]: n = Nucular.Nucular('../testdata/enron')

In [33]: q = n.Query()

In [34]: q.anyWord('something')

In [35]: %time nres = q.resultDictionaries()
CPU times: user 13.34 s, sys: 7.46 s, total: 20.80 s
Wall time: 20.80 s

In [36]: len(nres)
Out[36]: 21979

In [39]: ix = index.open_dir('../testdata/enron_whoosh/')

In [40]: fields = ['Attendees', 'Bcc', 'Cc',
'Content_Transfer_Encoding', 'Content_Type', 'Date', 'From',
'Message_ID', 'Mime_Version', 'Re', 'Subject', 'Time', 'To',
'X_FileName', 'X_Folder', 'X_From', 'X_Origin', 'X_To', 'X_bcc',
'X_cc', 'b', 'i']

In [41]: qs = ' OR '.join(['%s:something' % field for field in fields])

In [42]: print qs
Attendees:something OR Bcc:something OR Cc:something OR
Content_Transfer_Encoding:something OR Content_Type:something OR
Date:something OR From:something OR Message_ID:something OR
Mime_Version:something OR Re:something OR Subject:something OR
Time:something OR To:something OR X_FileName:something OR
X_Folder:something OR X_From:something OR X_Origin:something OR
X_To:something OR X_bcc:something OR X_cc:something OR b:something OR
i:something

# Lazy:
In [43]: %time wres = ix.find(qs)
CPU times: user 2.77 s, sys: 0.07 s, total: 2.84 s
Wall time: 2.85 s

In [44]: len(wres)
Out[44]: 21959

# Eager, but from a fresh interpreter to avoid any caching that might
be taking place inside Whoosh:
In [6]: %time wres = list(ix.find(qs))
CPU times: user 3.97 s, sys: 0.34 s, total: 4.31 s
Wall time: 4.31 s


The difference in the counts is probably due to details of the
tokenization, but at least they are very close so I think we're doing
a nearly apples-to-apples comparison.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

whoosh_enron.py

Matt Chaput

unread,
May 9, 2009, 5:48:51 PM5/9/09
to who...@googlegroups.com
> 1.7 G for ~517000 messages.
>
> On my Mac Pro machine, Nucular (hg tip) took about 60 minutes to
> index. Whoosh (svn trunk) took at least twice as long (I need to wait
> until I get back to the office Monday to check, but the ETA was about
> 2.5 hours). Attached is the code I used to generate the Whoosh index.
> Increasing the number of messages for a given commit appears to
> improve index times, but when I tried doing all messages in a single
> commit, I ran into "too many open files" errors.

Whoosh's indexing can be made much faster by increasing the amount of
maximum amount of memory allowed for indexing (the default of 4 MB is
way too low).

# Use a maximum of 512 MB at a time for indexing
writer = my_index.writer(postlimit = 512 * 1024 * 1024)

(I need to add this to the docs)

My guess would be that's also the source of the "too many files"
issue, since dividing 1.7 GB of data into 4 MB chunks for a merge sort
is a LOT of chunks. Increasing the memory limit may fix it. But
besides increasing the limit, I should probably be smarter about that
somehow. Some kind of hierarchical merge-sort in cases where there's
tons of chunks? Or just print a warning or something?

Cheers,

Matt

Aaron Watters

unread,
May 9, 2009, 6:24:57 PM5/9/09
to Whoosh


On May 9, 5:20 pm, Robert Kern <robert.k...@gmail.com> wrote:
> On Thu, May 7, 2009 at 16:54, Aaron Watters <aaron.watt...@gmail.com> wrote:
>
> > On May 7, 5:36 pm, Matt Chaput <m...@whoosh.ca> wrote:
> >> Aaron Watters wrote:
>
> >> Not only do I not have 4 TB of email data, but I would never even
> >> consider indexing it with a pure-Python solution if I did. If Nucular
> >> can really index something like that from scratch in anything less than
> >> geological time periods, I'm astounded. I'm excited to check this out now!
>
> > You are right. I was off by 3 orders of magnitude
> > I was referring to the enron email corpus
>
> >  http://www.cs.cmu.edu/~enron/
>
> > (I hope it's still there).  I think it's 4 Gig (not T)
>
> 1.7 G for ~517000 messages.
>
> On my Mac Pro machine, Nucular (hg tip) took about 60 minutes to
> index. Whoosh (svn trunk) took at least twice as long (I need to wait
> until I get back to the office Monday to check, but the ETA was about
> 2.5 hours). Attached is the code I used to generate the Whoosh index.
> Increasing the number of messages for a given commit appears to
> improve index times, but when I tried doing all messages in a single
> commit, I ran into "too many open files" errors.
>
> It is difficult to compare search times because Whoosh pulls out the
> actual text of the results lazily and Nucular doesn't. By forcing the
> creation of a list with the actual result dictionaries, I get that
> Whoosh is much faster by a factor of 4-5.

1) Please try some queries which return small result sets, possibly by
anding
two common terms -- I think you will find that nucular is much faster
for these.
You are right that it will be slower for queries with 10's of
thousands of results
since it does compute the entire result before giving the first
answer.
[note: I should add a "rough result" option which doesn't do this, but
might
return a slightly too large result]

2) Please try some queries using a "warm" index -- the performance
improves
quite a bit when the indices have been paged into the file system
cache.

Thanks! -- Aaron Watters
http://aaron.oirt.rutgers.edu/myapp/docs/W1500.whyIsWhiffCool

===
ymmv rsn afiak imho fwiw (rotfl)


Robert Kern

unread,
May 9, 2009, 6:26:45 PM5/9/09
to who...@googlegroups.com
On Sat, May 9, 2009 at 16:48, Matt Chaput <ma...@whoosh.ca> wrote:
>
>> 1.7 G for ~517000 messages.
>>
>> On my Mac Pro machine, Nucular (hg tip) took about 60 minutes to
>> index. Whoosh (svn trunk) took at least twice as long (I need to wait
>> until I get back to the office Monday to check, but the ETA was about
>> 2.5 hours). Attached is the code I used to generate the Whoosh index.
>> Increasing the number of messages for a given commit appears to
>> improve index times, but when I tried doing all messages in a single
>> commit, I ran into "too many open files" errors.
>
> Whoosh's indexing can be made much faster by increasing the amount of
> maximum amount of memory allowed for indexing (the default of 4 MB is
> way too low).
>
> # Use a maximum of 512 MB at a time for indexing
> writer = my_index.writer(postlimit = 512 * 1024 * 1024)
>
> (I need to add this to the docs)

It seems to be doing better with this postlimit (I'm only 100k
messages in), but the ETA is still almost 2 hours.

clearing directory using rm -rf '../testdata/enron_whoosh'
Indexed 50000 messages in 426.4344 s
Committed in 282.9436 s
Done 50000 messages in 709.6117 s so far (ETA: 7337.3849 s).
Indexed 50000 messages in 356.8732 s
Committed in 221.8561 s
Done 100000 messages in 1288.3413 s so far (ETA: 6660.7244 s).
...

> My guess would be that's also the source of the "too many files"
> issue, since dividing 1.7 GB of data into 4 MB chunks for a merge sort
> is a LOT of chunks. Increasing the memory limit may fix it. But
> besides increasing the limit, I should probably be smarter about that
> somehow. Some kind of hierarchical merge-sort in cases where there's
> tons of chunks? Or just print a warning or something?

You should be able to check and set the current limit for the process
using the resource module:

In [2]: resource.getrlimit(resource.RLIMIT_NOFILE)
Out[2]: (256L, 9223372036854775807L)

That's (soft limit, hard limit). You can change the soft limit using
resource.setrlimit(), too.

http://docs.python.org/library/resource

That should help fail or degrade gracefully.

Robert Kern

unread,
May 9, 2009, 6:39:46 PM5/9/09
to who...@googlegroups.com

It would probably be worthwhile to implement a "lazy result" option
like Whoosh instead of a "rough result". Not only could we do better
apples-to-apples comparisons, it's usually a good feature for
human-facing search engines. Of course, that may not be part of
Nucular's aims.

> 2) Please try some queries using a "warm" index -- the performance
> improves
> quite a bit when the indices have been paged into the file system
> cache.

The timings are consistent when repeated (for at least that huge query
which may be killing the cache). I'll try a smaller query when I'm
finished with the Whoosh reindexing.

Matt Chaput

unread,
May 9, 2009, 6:45:58 PM5/9/09
to who...@googlegroups.com
> It seems to be doing better with this postlimit (I'm only 100k
> messages in), but the ETA is still almost 2 hours.

Boo-urns :( . I'll have to see if I can improve that.

> clearing directory using rm -rf '../testdata/enron_whoosh'
> Indexed 50000 messages in 426.4344 s
> Committed in 282.9436 s
> Done 50000 messages in 709.6117 s so far (ETA: 7337.3849 s).
> Indexed 50000 messages in 356.8732 s
> Committed in 221.8561 s
> Done 100000 messages in 1288.3413 s so far (ETA: 6660.7244 s).
> ...

So you're not indexing it all in one go?

Wow, 4 minute commits... ugh.

> You should be able to check and set the current limit for the process
> using the resource module:

Thanks Robert, I didn't know about that. And thanks for running the
tests.

The search times will definitely improve if I can ever find the time
to finish the changes to how Whoosh deals with postings (besides using
faster read/write methods, I imagine chunked, seekable posting lists
should be very helpful in a gigabyte collection). But maybe I should
give up on the idea of Whoosh being useful for gigabyte collections
before I get depressed ;)

Matt

Robert Kern

unread,
May 9, 2009, 6:55:43 PM5/9/09
to who...@googlegroups.com
On Sat, May 9, 2009 at 17:45, Matt Chaput <ma...@whoosh.ca> wrote:
>
>> It seems to be doing better with this postlimit (I'm only 100k
>> messages in), but the ETA is still almost 2 hours.
>
> Boo-urns :( . I'll have to see if I can improve that.
>
>> clearing directory using rm -rf '../testdata/enron_whoosh'
>> Indexed 50000 messages in 426.4344 s
>> Committed in 282.9436 s
>> Done 50000 messages in 709.6117 s so far (ETA: 7337.3849 s).
>> Indexed 50000 messages in 356.8732 s
>> Committed in 221.8561 s
>> Done 100000 messages in 1288.3413 s so far (ETA: 6660.7244 s).
>> ...
>
> So you're not indexing it all in one go?

My science training tells me to change just one thing at a time. :-)

> Wow, 4 minute commits... ugh.
>
>> You should be able to check and set the current limit for the process
>> using the resource module:
>
> Thanks Robert, I didn't know about that. And thanks for running the
> tests.
>
> The search times will definitely improve if I can ever find the time
> to finish the changes to how Whoosh deals with postings (besides using
> faster read/write methods, I imagine chunked, seekable posting lists
> should be very helpful in a gigabyte collection). But maybe I should
> give up on the idea of Whoosh being useful for gigabyte collections
> before I get depressed ;)

What I noticed when I've profiled indexing and querying before (not
with this dataset) is that the major hotspot appears to be pickling. I
think it would be profitable to look at a format that serializes the
Python objects faster. Marshalling is pretty dang snappy. You might
need to degrade to pickling for stored fields that might be arbitrary
instances. Or you could just drop that feature. :-)

Aaron Watters

unread,
May 9, 2009, 7:10:00 PM5/9/09
to Whoosh


On May 9, 6:24 pm, Aaron Watters <aaron.watt...@gmail.com> wrote:
>
> 1) Please try some queries which return small result sets, possibly by
> anding
> two common terms -- I think you will find that nucular is much faster
> for these.
> You are right that it will be slower for queries with 10's of
> thousands of results
> since it does compute the entire result before giving the first
> answer.
> [note: I should add a "rough result" option which doesn't do this, but
> might
> return a slightly too large result]

I lied. You can get the first result before getting all of them if
you use a more complex API protocol:

(result, status) = myQuery.evaluate()
identityList = result.identities()
count = len(identityList)
theIdentity = identityList[0]
theEntity = result.describe(theIdentity)

I'll eat my hat if the "something" query doesn't evaluate
in less than a couple seconds to pull the first entity
using this method.

But I have developed rather a taste for hats.

Robert Kern

unread,
May 10, 2009, 3:46:31 AM5/10/09
to who...@googlegroups.com

Your tastebuds are quite safe! Nucular is the clear winner by a huge margin.

In [19]: from queries import whoosh, nuke

In [20]: %time whoosh('something')
CPU times: user 1.53 s, sys: 0.07 s, total: 1.60 s
Wall time: 1.61 s
Out[21]: 21959

In [22]: %time nuke('something')
CPU times: user 0.08 s, sys: 0.02 s, total: 0.10 s
Wall time: 0.48 s
Out[23]: 21979

In [24]: %time nuke('something')
CPU times: user 0.07 s, sys: 0.01 s, total: 0.07 s
Wall time: 0.07 s
Out[25]: 21979

In [26]: %time nuke('something')
CPU times: user 0.07 s, sys: 0.01 s, total: 0.07 s
Wall time: 0.07 s
Out[27]: 21979

In [28]: %time nuke('something')
CPU times: user 0.06 s, sys: 0.01 s, total: 0.07 s
Wall time: 0.07 s
Out[29]: 21979

In [30]: %time nuke('something', 'else')
CPU times: user 0.11 s, sys: 0.01 s, total: 0.12 s
Wall time: 0.12 s
Out[31]: 3837

In [32]: %time nuke('something', 'else')
CPU times: user 0.11 s, sys: 0.01 s, total: 0.12 s
Wall time: 0.12 s
Out[33]: 3837

In [34]: %time nuke('something', 'else')
CPU times: user 0.11 s, sys: 0.01 s, total: 0.12 s
Wall time: 0.12 s
Out[35]: 3837

In [36]: %time whoosh('something', 'else')
CPU times: user 2.15 s, sys: 0.11 s, total: 2.25 s
Wall time: 2.26 s
Out[37]: 3412

In [38]: %time whoosh('something', 'else')
CPU times: user 2.11 s, sys: 0.10 s, total: 2.22 s
Wall time: 2.22 s
Out[39]: 3412

queries.py

Matt Chaput

unread,
May 10, 2009, 9:31:55 AM5/10/09
to who...@googlegroups.com
>> I lied. You can get the first result before getting all of them if
>> you use a more complex API protocol:
>>
>> (result, status) = myQuery.evaluate()
>> identityList = result.identities()
>> count = len(identityList)
>> theIdentity = identityList[0]
>> theEntity = result.describe(theIdentity)
>>
>> I'll eat my hat if the "something" query doesn't evaluate
>> in less than a couple seconds to pull the first entity
>> using this method.
>>
>> But I have developed rather a taste for hats.
>
> Your tastebuds are quite safe! Nucular is the clear winner by a huge
> margin.

Well, I hate to keep up the back and forth, but is that really just to
get the first (unscored!) result? I don't think that's a fair
comparison of what the two libraries are doing in this case, since a
Whoosh Results object always scores (using BM25F by default) the top N
results (where N is the "limit" keyword arg to Searcher.search(),
default=5000*). So Whoosh is running the scoring algorithm and looking
up field lengths for up to 5000 results, which I'm guessing is the
difference.

To get just one result in Whoosh without scoring, you would do
something like this:

searcher = my_index.searcher()
query = And([Term("content", u"something"), Term("content", u"else")])
docnums = list(query.docs(searcher))
print len(docnums)
print searcher.stored_fields(docnums[0])

I'd guess the speed of this would be comparable to the Nukular results.

* I assumed the naive user would expect the default to be to find and
return "everything", so I set the default really high, and let the
user lower it for efficiency. Maybe I should have set the default to 10.

Anyway, thanks for putting so much time into this, Robert. This is
really poking me to finish writing the Whoosh docs, which can only be
a good thing.

Also, while I have you on the line Aaron, how do you do an
"OR" (disjoint) query in Nukular? How do you compose compound (I mean
hierarchical, as in nested brackets) queries? I can't tell from the
docs.

Thanks,

Matt

Aaron Watters

unread,
May 10, 2009, 11:17:34 AM5/10/09
to Whoosh

> Also, while I have you on the line Aaron, how do you do an  
> "OR" (disjoint) query in Nukular? How do you compose compound (I mean  
> hierarchical, as in nested brackets) queries? I can't tell from the  
> docs.

Right now Nucular won't do this for you -- you have to write two
queries and union the results. I think I'll implement this, now
that you mention it :).
===
never eat anything bigger than your head. -- kliban

Aaron Watters

unread,
May 10, 2009, 12:10:27 PM5/10/09
to Whoosh

> Well, I hate to keep up the back and forth, but is that really just to  
> get the first (unscored!) result? I don't think that's a fair  
> comparison of what the two libraries are doing in this case, since a  
> Whoosh Results object always scores (using BM25F by default) the top N  
> results (where N is the "limit" keyword arg to Searcher.search(),  
> default=5000*). So Whoosh is running the scoring algorithm and looking  
> up field lengths for up to 5000 results, which I'm guessing is the  
> difference.

It's conceivable.

I propose you modify your promotional material
to say "whoosh is much faster than any other
pure python full text indexing system <em>
which uses page ranking heuristics</em>.

This is why I don't like "approximate queries"
-- they make the queries slower and they usually
don't actually do what the user wants. [In my
case they usually point me to someone who
wants to sell me something whereas I was looking
for information.] Your inability to find nucular
is a case in point.

-- Aaron Watters
===
less is more

Matt Chaput

unread,
May 10, 2009, 3:07:01 PM5/10/09
to who...@googlegroups.com
> I propose you modify your promotional material
> to say "whoosh is much faster than any other
> pure python full text indexing system <em>
> which uses page ranking heuristics</em>.

Fair enough I suppose.

> This is why I don't like "approximate queries"
> -- they make the queries slower and they usually
> don't actually do what the user wants. [In my
> case they usually point me to someone who
> wants to sell me something whereas I was looking
> for information.] Your inability to find nucular
> is a case in point.

There's nothing approximate about scored queries; they give you the
exact same results as a plain boolean query, just arranged in a
certain order. I doubt the users of my online help would appreciate
having to sift through 1000 randomly ordered results for the query
"render" when the scoring algorithm can put "how to render" in the
first spot. But to each his own.

Matt

Robert Kern

unread,
May 10, 2009, 5:07:15 PM5/10/09
to who...@googlegroups.com
On Sun, May 10, 2009 at 08:31, Matt Chaput <ma...@whoosh.ca> wrote:
> To get just one result in Whoosh without scoring, you would do
> something like this:
>
> searcher = my_index.searcher()
> query = And([Term("content", u"something"), Term("content", u"else")])
> docnums = list(query.docs(searcher))
> print len(docnums)
> print searcher.stored_fields(docnums[0])
>
> I'd guess the speed of this would be comparable to the Nukular results.

Faster, but still pretty far from Nucular:

In [1]: import queries

In [2]: %time queries.whoosh('something', 'else')
CPU times: user 1.13 s, sys: 0.06 s, total: 1.19 s
Wall time: 1.20 s
Out[3]: 3412

In [4]: %time queries.whoosh('something', 'else')
CPU times: user 1.13 s, sys: 0.03 s, total: 1.16 s
Wall time: 1.16 s
Out[5]: 3412

In [6]: %time queries.whoosh('something', 'else')
CPU times: user 1.18 s, sys: 0.04 s, total: 1.22 s
Wall time: 1.22 s
Out[7]: 3412

queries.py

Robert Kern

unread,
May 10, 2009, 5:09:29 PM5/10/09
to who...@googlegroups.com
On Sun, May 10, 2009 at 11:10, Aaron Watters <aaron....@gmail.com> wrote:
> This is why I don't like "approximate  queries"
> -- they make the queries slower and they usually
> don't actually do what the user wants.  [In my
> case they usually point me to someone who
> wants to sell me something whereas I was looking
> for information.]  Your inability to find nucular
> is a case in point.

The latter has probably more to do with its not being registered with
the canonical index for such things rather than any particular search
technology:

http://pypi.python.org/pypi

Matt Chaput

unread,
May 10, 2009, 5:32:35 PM5/10/09
to who...@googlegroups.com
> Faster, but still pretty far from Nucular:

Huh, that's really surprising. I have to look at all this in depth.

Thanks,

Matt

Matt Chaput

unread,
May 10, 2009, 5:42:20 PM5/10/09
to who...@googlegroups.com
> Faster, but still pretty far from Nucular:

Oops, forgot to ask: what were the sizes of the indexes built by the
two libraries?

Thanks,

Matt

Robert Kern

unread,
May 10, 2009, 5:59:35 PM5/10/09
to who...@googlegroups.com

[testdata]$ du -hsc enron enron_whoosh
6.3G enron
1.2G enron_whoosh
7.5G total

Aaron Watters

unread,
May 10, 2009, 10:03:45 PM5/10/09
to Whoosh

> [testdata]$ du -hsc enron enron_whoosh
> 6.3G    enron
> 1.2G    enron_whoosh
> 7.5G    total

Nucular looks pretty wasteful by comparison, so there's
a valid tradeoff. There are built in ways to shrink things
down in nucular:

1) by default nucular stores the source data as well as
the index -- this can be turned off, but this wouldn't
be enough to get down to 1.2G.
2) you can turn off fielded search and only allow unfielded
full text searching, which might help a lot more, at the
expense of losing functionality.
3) I can think of some tricks that aren't implemented yet
but that's cheating :).

btw: does whoosh support prefix, match, and range queries
for fields?

-- Aaron Watters

===
I haven't heard anything like that
since the orphanage burned down.
-- Twain, on opera

Robert Kern

unread,
May 10, 2009, 10:19:22 PM5/10/09
to who...@googlegroups.com
On Sun, May 10, 2009 at 21:03, Aaron Watters <aaron....@gmail.com> wrote:
>
>
>> [testdata]$ du -hsc enron enron_whoosh
>> 6.3G    enron
>> 1.2G    enron_whoosh
>> 7.5G    total
>
> Nucular looks pretty wasteful by comparison, so there's
> a valid tradeoff.  There are built in ways to shrink things
> down in nucular:
>
> 1) by default nucular stores the source data as well as
>    the index -- this can be turned off, but this wouldn't
>    be enough to get down to 1.2G.

I made sure that the Whoosh example is storing the source data, too.

Aaron Watters

unread,
May 15, 2009, 1:16:18 PM5/15/09
to Whoosh


On May 10, 3:07 pm, Matt Chaput <m...@whoosh.ca> wrote:
> > I propose you modify your promotional material
> > to say "whoosh is much faster than any other
> > pure python full text indexing system <em>
> > which uses page ranking heuristics</em>.
>
> Fair enough I suppose.

A quick word to the wise: you may be able to goose
the performance noticably by turning off Python garbage
collection temporarily during query evaluation, since
it's likely that you are just allocating and never releasing
until you are done, so the gc is just wasting cycles (and
it can waste them pretty badly in some cases). I saw
about a 30% improvement on big queries when I added
this.
http://nucular.sourceforge.net

ps: boolean queries ETA next week.

===
an apple every 8 hours
will keep 3 doctors away. --kliban

Matt Chaput

unread,
May 15, 2009, 1:50:07 PM5/15/09
to who...@googlegroups.com
Aaron Watters wrote:
> A quick word to the wise: you may be able to goose
> the performance noticably by turning off Python garbage
> collection temporarily during query evaluation, since
> it's likely that you are just allocating and never releasing
> until you are done, so the gc is just wasting cycles (and
> it can waste them pretty badly in some cases). I saw
> about a 30% improvement on big queries when I added
> this.
>
That's a good idea, thanks. I've been evaluating Whoosh on the Enron
corpus and I'm starting to get better performance with some simple
changes. Part of it is education (a.k.a. FINISH THE DOCS ALREADY!), i.e.
a docs page on tuning for very large collections. For example Robert's
test code uses TEXT field types for everything, which a naive user would
probably do, but is wasteful since it generates and stores term
frequency/positions for fields that don't need it (like the document
number ;).

Matt

Robert Kern

unread,
May 15, 2009, 3:37:45 PM5/15/09
to who...@googlegroups.com

Actually, I wasn't *that* naive. :-)

I deliberately used TEXT fields with storage in order to try to make
an apples-to-apples comparison with Nucular's example, which applies
the same analysis to all of the fields. The next step would be to make
an optimized-to-optimized comparison. Both are important, but
apples-to-apples needs to be done first in order to inform the later
comparisons.

Aaron Watters

unread,
May 19, 2009, 7:52:57 AM5/19/09
to Whoosh
On May 10, 11:17 am, Aaron Watters <aaron.watt...@gmail.com> wrote:
> > Also, while I have you on the line Aaron, how do you do an  
> > "OR" (disjoint) query in Nukular? How do you compose compound (I mean  
> > hierarchical, as in nested brackets) queries? I can't tell from the  
> > docs.
>
> Right now Nucular won't do this for you...

But as of yesterday, now it does

# get list of dictionaries mentioning dogs or cats but not snails
L = session.dictionaries("(dogs | cats) ~snails")

http://nucular.sourceforge.net/Boolean.html

Thanks for the suggestion. I've been having a lot of fun with
it, for example finding offensive sexist jokes in the Enron
archive. What a buncha jerks!
-- Aaron Watters

===
sample from the Enron email archive:
Q: What's wrong if your sitting on the couch
and your wife comes out of the kitchen
complaining at you?
A: Her chain is too long.
(awful! just awful!)

Chris Clark

unread,
May 23, 2009, 2:38:58 PM5/23/09
to Whoosh


On May 9, 3:55 pm, Robert Kern <robert.k...@gmail.com> wrote:
> ....What I noticed when I've profiled indexing and querying before (not
> with this dataset) is that the major hotspot appears to be pickling. I
> think it would be profitable to look at a format that serializes the
> Python objects faster. Marshalling is pretty dang snappy. You might
> need to degrade to pickling for stored fields that might be arbitrary
> instances. Or you could just drop that feature. :-)

There is an interesting benchmark at http://inkdroid.org/journal/2008/10/24/json-vs-pickle/
that shows one of the json serializers being faster than (c)Pickle.
Not sure if that is useful here but is interesting.
Reply all
Reply to author
Forward
0 new messages