Not only do I not have 4 TB of email data, but I would never even
consider indexing it with a pure-Python solution if I did. If Nucular
can really index something like that from scratch in anything less than
geological time periods, I'm astounded. I'm excited to check this out now!
By the way, I've been trying to find time to set up some kind of
benchmarking harness for various Python search solutions (e.g. PyLucene,
xappy, textindexng), especially since I'm working on performance
improvements. I don't suppose Nucular has something like that already?
Anyway, I'll include Nucular in whatever custom thing I come up with.
Cheers,
Matt
1.7 G for ~517000 messages.
On my Mac Pro machine, Nucular (hg tip) took about 60 minutes to
index. Whoosh (svn trunk) took at least twice as long (I need to wait
until I get back to the office Monday to check, but the ETA was about
2.5 hours). Attached is the code I used to generate the Whoosh index.
Increasing the number of messages for a given commit appears to
improve index times, but when I tried doing all messages in a single
commit, I ran into "too many open files" errors.
It is difficult to compare search times because Whoosh pulls out the
actual text of the results lazily and Nucular doesn't. By forcing the
creation of a list with the actual result dictionaries, I get that
Whoosh is much faster by a factor of 4-5.
In [31]: from nucular import Nucular
In [32]: n = Nucular.Nucular('../testdata/enron')
In [33]: q = n.Query()
In [34]: q.anyWord('something')
In [35]: %time nres = q.resultDictionaries()
CPU times: user 13.34 s, sys: 7.46 s, total: 20.80 s
Wall time: 20.80 s
In [36]: len(nres)
Out[36]: 21979
In [39]: ix = index.open_dir('../testdata/enron_whoosh/')
In [40]: fields = ['Attendees', 'Bcc', 'Cc',
'Content_Transfer_Encoding', 'Content_Type', 'Date', 'From',
'Message_ID', 'Mime_Version', 'Re', 'Subject', 'Time', 'To',
'X_FileName', 'X_Folder', 'X_From', 'X_Origin', 'X_To', 'X_bcc',
'X_cc', 'b', 'i']
In [41]: qs = ' OR '.join(['%s:something' % field for field in fields])
In [42]: print qs
Attendees:something OR Bcc:something OR Cc:something OR
Content_Transfer_Encoding:something OR Content_Type:something OR
Date:something OR From:something OR Message_ID:something OR
Mime_Version:something OR Re:something OR Subject:something OR
Time:something OR To:something OR X_FileName:something OR
X_Folder:something OR X_From:something OR X_Origin:something OR
X_To:something OR X_bcc:something OR X_cc:something OR b:something OR
i:something
# Lazy:
In [43]: %time wres = ix.find(qs)
CPU times: user 2.77 s, sys: 0.07 s, total: 2.84 s
Wall time: 2.85 s
In [44]: len(wres)
Out[44]: 21959
# Eager, but from a fresh interpreter to avoid any caching that might
be taking place inside Whoosh:
In [6]: %time wres = list(ix.find(qs))
CPU times: user 3.97 s, sys: 0.34 s, total: 4.31 s
Wall time: 4.31 s
The difference in the counts is probably due to details of the
tokenization, but at least they are very close so I think we're doing
a nearly apples-to-apples comparison.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
Whoosh's indexing can be made much faster by increasing the amount of
maximum amount of memory allowed for indexing (the default of 4 MB is
way too low).
# Use a maximum of 512 MB at a time for indexing
writer = my_index.writer(postlimit = 512 * 1024 * 1024)
(I need to add this to the docs)
My guess would be that's also the source of the "too many files"
issue, since dividing 1.7 GB of data into 4 MB chunks for a merge sort
is a LOT of chunks. Increasing the memory limit may fix it. But
besides increasing the limit, I should probably be smarter about that
somehow. Some kind of hierarchical merge-sort in cases where there's
tons of chunks? Or just print a warning or something?
Cheers,
Matt
It seems to be doing better with this postlimit (I'm only 100k
messages in), but the ETA is still almost 2 hours.
clearing directory using rm -rf '../testdata/enron_whoosh'
Indexed 50000 messages in 426.4344 s
Committed in 282.9436 s
Done 50000 messages in 709.6117 s so far (ETA: 7337.3849 s).
Indexed 50000 messages in 356.8732 s
Committed in 221.8561 s
Done 100000 messages in 1288.3413 s so far (ETA: 6660.7244 s).
...
> My guess would be that's also the source of the "too many files"
> issue, since dividing 1.7 GB of data into 4 MB chunks for a merge sort
> is a LOT of chunks. Increasing the memory limit may fix it. But
> besides increasing the limit, I should probably be smarter about that
> somehow. Some kind of hierarchical merge-sort in cases where there's
> tons of chunks? Or just print a warning or something?
You should be able to check and set the current limit for the process
using the resource module:
In [2]: resource.getrlimit(resource.RLIMIT_NOFILE)
Out[2]: (256L, 9223372036854775807L)
That's (soft limit, hard limit). You can change the soft limit using
resource.setrlimit(), too.
http://docs.python.org/library/resource
That should help fail or degrade gracefully.
It would probably be worthwhile to implement a "lazy result" option
like Whoosh instead of a "rough result". Not only could we do better
apples-to-apples comparisons, it's usually a good feature for
human-facing search engines. Of course, that may not be part of
Nucular's aims.
> 2) Please try some queries using a "warm" index -- the performance
> improves
> quite a bit when the indices have been paged into the file system
> cache.
The timings are consistent when repeated (for at least that huge query
which may be killing the cache). I'll try a smaller query when I'm
finished with the Whoosh reindexing.
Boo-urns :( . I'll have to see if I can improve that.
> clearing directory using rm -rf '../testdata/enron_whoosh'
> Indexed 50000 messages in 426.4344 s
> Committed in 282.9436 s
> Done 50000 messages in 709.6117 s so far (ETA: 7337.3849 s).
> Indexed 50000 messages in 356.8732 s
> Committed in 221.8561 s
> Done 100000 messages in 1288.3413 s so far (ETA: 6660.7244 s).
> ...
So you're not indexing it all in one go?
Wow, 4 minute commits... ugh.
> You should be able to check and set the current limit for the process
> using the resource module:
Thanks Robert, I didn't know about that. And thanks for running the
tests.
The search times will definitely improve if I can ever find the time
to finish the changes to how Whoosh deals with postings (besides using
faster read/write methods, I imagine chunked, seekable posting lists
should be very helpful in a gigabyte collection). But maybe I should
give up on the idea of Whoosh being useful for gigabyte collections
before I get depressed ;)
Matt
My science training tells me to change just one thing at a time. :-)
> Wow, 4 minute commits... ugh.
>
>> You should be able to check and set the current limit for the process
>> using the resource module:
>
> Thanks Robert, I didn't know about that. And thanks for running the
> tests.
>
> The search times will definitely improve if I can ever find the time
> to finish the changes to how Whoosh deals with postings (besides using
> faster read/write methods, I imagine chunked, seekable posting lists
> should be very helpful in a gigabyte collection). But maybe I should
> give up on the idea of Whoosh being useful for gigabyte collections
> before I get depressed ;)
What I noticed when I've profiled indexing and querying before (not
with this dataset) is that the major hotspot appears to be pickling. I
think it would be profitable to look at a format that serializes the
Python objects faster. Marshalling is pretty dang snappy. You might
need to degrade to pickling for stored fields that might be arbitrary
instances. Or you could just drop that feature. :-)
Your tastebuds are quite safe! Nucular is the clear winner by a huge margin.
In [19]: from queries import whoosh, nuke
In [20]: %time whoosh('something')
CPU times: user 1.53 s, sys: 0.07 s, total: 1.60 s
Wall time: 1.61 s
Out[21]: 21959
In [22]: %time nuke('something')
CPU times: user 0.08 s, sys: 0.02 s, total: 0.10 s
Wall time: 0.48 s
Out[23]: 21979
In [24]: %time nuke('something')
CPU times: user 0.07 s, sys: 0.01 s, total: 0.07 s
Wall time: 0.07 s
Out[25]: 21979
In [26]: %time nuke('something')
CPU times: user 0.07 s, sys: 0.01 s, total: 0.07 s
Wall time: 0.07 s
Out[27]: 21979
In [28]: %time nuke('something')
CPU times: user 0.06 s, sys: 0.01 s, total: 0.07 s
Wall time: 0.07 s
Out[29]: 21979
In [30]: %time nuke('something', 'else')
CPU times: user 0.11 s, sys: 0.01 s, total: 0.12 s
Wall time: 0.12 s
Out[31]: 3837
In [32]: %time nuke('something', 'else')
CPU times: user 0.11 s, sys: 0.01 s, total: 0.12 s
Wall time: 0.12 s
Out[33]: 3837
In [34]: %time nuke('something', 'else')
CPU times: user 0.11 s, sys: 0.01 s, total: 0.12 s
Wall time: 0.12 s
Out[35]: 3837
In [36]: %time whoosh('something', 'else')
CPU times: user 2.15 s, sys: 0.11 s, total: 2.25 s
Wall time: 2.26 s
Out[37]: 3412
In [38]: %time whoosh('something', 'else')
CPU times: user 2.11 s, sys: 0.10 s, total: 2.22 s
Wall time: 2.22 s
Out[39]: 3412
Well, I hate to keep up the back and forth, but is that really just to
get the first (unscored!) result? I don't think that's a fair
comparison of what the two libraries are doing in this case, since a
Whoosh Results object always scores (using BM25F by default) the top N
results (where N is the "limit" keyword arg to Searcher.search(),
default=5000*). So Whoosh is running the scoring algorithm and looking
up field lengths for up to 5000 results, which I'm guessing is the
difference.
To get just one result in Whoosh without scoring, you would do
something like this:
searcher = my_index.searcher()
query = And([Term("content", u"something"), Term("content", u"else")])
docnums = list(query.docs(searcher))
print len(docnums)
print searcher.stored_fields(docnums[0])
I'd guess the speed of this would be comparable to the Nukular results.
* I assumed the naive user would expect the default to be to find and
return "everything", so I set the default really high, and let the
user lower it for efficiency. Maybe I should have set the default to 10.
Anyway, thanks for putting so much time into this, Robert. This is
really poking me to finish writing the Whoosh docs, which can only be
a good thing.
Also, while I have you on the line Aaron, how do you do an
"OR" (disjoint) query in Nukular? How do you compose compound (I mean
hierarchical, as in nested brackets) queries? I can't tell from the
docs.
Thanks,
Matt
Fair enough I suppose.
> This is why I don't like "approximate queries"
> -- they make the queries slower and they usually
> don't actually do what the user wants. [In my
> case they usually point me to someone who
> wants to sell me something whereas I was looking
> for information.] Your inability to find nucular
> is a case in point.
There's nothing approximate about scored queries; they give you the
exact same results as a plain boolean query, just arranged in a
certain order. I doubt the users of my online help would appreciate
having to sift through 1000 randomly ordered results for the query
"render" when the scoring algorithm can put "how to render" in the
first spot. But to each his own.
Matt
Faster, but still pretty far from Nucular:
In [1]: import queries
In [2]: %time queries.whoosh('something', 'else')
CPU times: user 1.13 s, sys: 0.06 s, total: 1.19 s
Wall time: 1.20 s
Out[3]: 3412
In [4]: %time queries.whoosh('something', 'else')
CPU times: user 1.13 s, sys: 0.03 s, total: 1.16 s
Wall time: 1.16 s
Out[5]: 3412
In [6]: %time queries.whoosh('something', 'else')
CPU times: user 1.18 s, sys: 0.04 s, total: 1.22 s
Wall time: 1.22 s
Out[7]: 3412
The latter has probably more to do with its not being registered with
the canonical index for such things rather than any particular search
technology:
Huh, that's really surprising. I have to look at all this in depth.
Thanks,
Matt
Oops, forgot to ask: what were the sizes of the indexes built by the
two libraries?
Thanks,
Matt
[testdata]$ du -hsc enron enron_whoosh
6.3G enron
1.2G enron_whoosh
7.5G total
I made sure that the Whoosh example is storing the source data, too.
Matt
Actually, I wasn't *that* naive. :-)
I deliberately used TEXT fields with storage in order to try to make
an apples-to-apples comparison with Nucular's example, which applies
the same analysis to all of the fields. The next step would be to make
an optimized-to-optimized comparison. Both are important, but
apples-to-apples needs to be done first in order to inform the later
comparisons.