I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.
For starters I used PySqLite, using a table with index on wordId. It
works fine but it consumes way too much space on the disk. For example
for 19103 pairs it takes 1686528 bytes, which amounts to around 88
bytes per pair of integers, way too much in my opinion (especially
considering the range of those integers - one is in the range
1..100000 and the other in the range 1..500000).
So my problem now is what should I use to store my data and still be
able to retrieve all messages for a given word very quickly. Either I
find a workaround in sqlite to make it store values on less bytes, or
I use something else.
I looked at the (g)dbm family but they store one entry per word and
both must be strings. To use it would mean to store a list of messages
for a word as a string, then when I find a new message containing this
word to replace this string with old string + the new message id ->
and I think this is inefficient.
At the moment I am looking at BerkeleyDB - still checking if this will
suit my needs.
Any advices and hints ? I'm looking only for embedded solutions like
PySqLite, if a database needs to install a server then it's not an
option.
Thanks in advance,
Andrei
Using any kind of database in a big message system may thrash your
disk pretty badly if you update the database every time a new message
is entered. That means making hundreds of new database entries, one
for each word in the message, each entry needing a disk seek. If
you're adding a few new messages a second, it will be hard for your
disk to keep up.
Anyway you don't need a real database. You're not doing that kind of
accesses, you don't need multi-phase transactions, blah blah blah. I
think the way to do it instead is keep several tables:
1) a static table on disk, simply (wordId, msgId) pairs of 4-bit
binary ints, sorted on wordId. You can also have an index for
it: a table with one entry for each wordId, saying where in
the big table the msgId's for that word start.
2) an update log on disk, (wordId, msgId) pairs like above, but
unsorted.
3) an in-memory table (Python dict) containing the same pairs as
the update log, but using a dict structure for instant lookup.
When a new message comes in, you just append all its (wordId,msgId)
pairs to the update log (just append, no random seeking needed), and
also update the in-memory table. Actually you don't need to store
(wordId,msgId) for every pair. Just append a record that says
"next nnn records are for msgId xxx" and then only store the wordId's.
When a search query comes in, look in the in-memory table by hashing,
and if the word isn't found, look in the static table by binary search.
Then once a night, update the static table by running the update log
through a sort utility and then merging it with the static table.
Then you can delete the update log and in-memory table and begin them
again. The sort-merge phase is relatively fast because sorting
utilities are written to sling around large buffers, minimizing disk
head motion.
If the system crashes or you stop the server for some reason, on
restart you can simply read the update log into memory.
You probably don't even need the in-memory structure if you're not
taking too many search queries--the update log should be small enough
that you can just sequentially scan it when a query comes in.
The main idea is that you replace a lot of random small updates with
occasional big sorting operations. The big sorts are far more
i/o-efficient than the small updates. These are the methods they
always used in the old days, when even random-access disk space was
very expensive, so big data sets were always stored on completely
sequential media (magnetic tape).
A pure Python fulltext indexer - http://divmod.org/Lupy/index.html
Jp
Stormbringer> Hi, I want to implement a fulltext search for
Stormbringer> messages in a forum. More exactly for each message I
Stormbringer> store pairs (wordId, msgId) for each identified word
Stormbringer> and when I search something I want to be able to
Stormbringer> retrieve very quickly all msgId for a given wordId.
Stormbringer> For starters I used PySqLite, using a table with
Stormbringer> index on wordId. It works fine but it consumes way
Stormbringer> too much space on the disk. For example for 19103
Stormbringer> pairs it takes 1686528 bytes, which amounts to
Stormbringer> around 88 bytes per pair of integers, way too much
Stormbringer> in my opinion (especially considering the range of
Stormbringer> those integers - one is in the range 1..100000 and
Stormbringer> the other in the range 1..500000).
What about using a binary file of unsigned ints which you load into a
python dictionary and do everything in memory? There would be no
extra overhead in the file and it would be very fast, if you are able
to hold the 100,000 ints in memory.
JDH
No it's much worse than that. The 100,000 ints are index numbers
for individual words. The 500,000 ints are articles and there can
be thousands of words in each article. So you may need to store
billions of ints, not just 100,000 of them.
There's the rub. PySQLlite consumes a lot of disk space presumably to
support efficient access to the data (hashing keys, etc). If you want
compact storage you aren't likely to do much better than
wordid1:msgid1
wordid2:msgid2
...
Unfortunately, searching that will probably be a bit slow.
If the word ids and msg ids have fixed maximum sizes you might try zero- or
space- padding them so your records are of fixed length. You can then store
the file sorted by wordid and do a binary search in the file looking records
of interest.
Another possibility, assuming your database keys are integers, is to try the
bsddb recno format database. I've never used it before (not as much call
for a database file with integer keys), but it seems to be *relatively*
space-efficient. I believe it doesn't actually need to store the keys,
since they are implicit.
I built a recno db file containing the pickle of the squares of the numbers
in range(1, 100000):
>>> for n in range(1,100000):
... db1[n] = pickle.dumps(n*n)
...
The total string length of the pickled squares was 1.3MB. The file size was
2.1MB.
Skip
Thanks ! This is exactly what I needed, and the size of the indexes is
around 30%, much much less than what I could have achieved with my
code. Not to mention the fact that I get phrase search and some other
goodies :)
The only thing that bothers me a little is the speed for building the
index, I tried with around 5000 messages and I am not quite thrilled,
it's not _extremly_ slow but it has to be faster for what I need.
Perhaps I'll use the C++ version with some Python bindings.
Cheers,
Andrei
Yes, I agree. The messages aren't very large (under 5K each, although
I've seen one or two of 500K) but there are lots of words.
Turns out there is a much better solution than what I could have
written, and it's named Lupy.
Andrei
Why not do some profiling first. Maybe it's limited by i/o traffic
rather than cpu cycles. I don't know how Lupy works but the one time
I messed with full text indexing, the bottleneck was definitely the
random disk accesses needed for every word of each update. The
solution is to batch the updates. Sorting is much less seek intensive
than random updates.
Yea, I hear that. Work is being done on speeding it up (pretty much the
only development on it now is optimization). I don't know how it will end
up, but things look promising so far. On the other hand, if you don't want
to wait for that to be finished...
Jp
Sounds like a poorly designed system the one you messed with.
Even if it was limited i/o traffic I am not sure how to do profiling
to find out if this is the case (although I do have doubts about this)
- any suggesions how ?
Andrei
Well - that depends. If there will be a faster version of lupy when
I'll really need it in 1-2 months then I will use that. Else if I can
find a faster equivalent I will use that. Just beeing practical.
Andrei
Well - I want to use this storage to speed up my searches so searching
will really need to be fast.
I didn't want the most compact storage, but at 88 bytes to store a
couple of integers is just way too much.
Another reason I wanted to use a database like sqlite is that it
presents some safety guarantees. If something goes wrong when updating
(like no diskspace left on end-user's disk) then the data is still
says safe, i.e. the latest transaction is not commited and that's it.
I wrote a similar version of what I wanted a couple of years ago not
using a database, just C/C++ code, something like this (I think a
similar method was suggested in this thread) : I was processing
messages in batches, so for each 10000 messages or so I would make in
memory a list of words and in what messages occur, and for each word I
would write all messages that contain it as consecutive entries in a
file. Then I would update the global word index (kept in another
file), which for each work kept a linked list of zones in the other
file where msgIds containg this word were.
Worked pretty well, but this was for a server I controlled, i.e. at
all times I was sure I had enough space and I was controling it. For
an application to be deployed to end-users I need more safety. That is
why I am playing with sql.
> If the word ids and msg ids have fixed maximum sizes you might try zero- or
> space- padding them so your records are of fixed length. You can then store
> the file sorted by wordid and do a binary search in the file looking records
> of interest.
Still would be too slow for a search, and search is the main reason
for this structure.
> Another possibility, assuming your database keys are integers, is to try the
> bsddb recno format database. I've never used it before (not as much call
> for a database file with integer keys), but it seems to be *relatively*
> space-efficient. I believe it doesn't actually need to store the keys,
> since they are implicit.
>
> I built a recno db file containing the pickle of the squares of the numbers
> in range(1, 100000):
>
> >>> for n in range(1,100000):
> ... db1[n] = pickle.dumps(n*n)
> ...
>
> The total string length of the pickled squares was 1.3MB. The file size was
> 2.1MB.
Interesting.
Fortunately I found lupy which suits me, but if I were not I might
have been tempted to experiment with that.
Andrei
The system I messed with wasn't designed for high volumes of updates.
> Even if it was limited i/o traffic I am not sure how to do profiling
> to find out if this is the case (although I do have doubts about this)
> - any suggesions how ?
A crude but maybe useful thing you can do is just observe the CPU load
while the program is running. If the CPU load is low but you hear the
disk banging around, you're i/o bound. As for profiling, there's a
Python profiler documented in the Python library manual, that's the
first thing to try.
There's a good article series on full text indexing at tbray.com which
might interest you, by the way. It was linked from Slashdot a couple
days ago.
>> Even if it was limited i/o traffic I am not sure how to do profiling
>
> A crude but maybe useful thing you can do is just observe the CPU load
> while the program is running. If the CPU load is low but you hear the
> disk banging around, you're i/o bound.
I love this one. It would make a great intro quote for the next DrDobbs
weekly summary ! ;-)
--
YAFAP : http://www.multimania.com/fredp/
>I want to implement a fulltext search for messages in a forum. More
>exactly for each message I store pairs (wordId, msgId) for each
>identified word and when I search something I want to be able to
>retrieve very quickly all msgId for a given wordId.
>
>So my problem now is what should I use to store my data and still be
>able to retrieve all messages for a given word very quickly. Either I
>find a workaround in sqlite to make it store values on less bytes, or
>I use something else.
Try the shelve module: it comes with Python, it's like a persistent
dictionary whith key=string, value=anything. Uses dbm and pickle internally.
Instead of storing pairs, you could store:
wordId: (tuple of msgIds)
so your primary search is just like a dictionary lookup: db[wordId]
Gabriel Genellina
Softlab SRL
I ment I didn't know how to do profiling in this case, to find out if
the program was cpu or disk bound, although if I would have sit down
and thought for 5 seconds it was obvious.
Profiling in general is another fish and I consider myself able to
spot the places where my programs consumes cpu cycles.
Cheers,
Andrei
On Windows, to determine if you are I/O bound, you might try to watch
these two counters:
- Physical Disk: Avg. Disk Queue Length (should be not more than three
times the number of physical disks (spindles) used by your program)
- Physical Disk: Avg. Disk sec/Transfer (should be less than 15-18 ms)
HTH, Karol
For future reference:
If you're like me, and have a lot of c/c++ code that you've built and are
happy with, pyrex ( http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/ )
can come in handy for reusing that code within python. Pyrex is easier to
write and maintain than the direct c-to-python API, and with a little bit
of practice, you can learn how to easily marshal data back and forth
between the two languages. Essentially, it takes all the grunt-work out
of writing making c/c++ calls from python. It's especially tasty when you
need to make use of some shared library that python has no interface to.
Sam Walters