Storing pairs of (int,int) in a database : which db to choose ?

Stormbringer

unread,

Dec 23, 2003, 7:35:50 AM12/23/03

to

Hi,

I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.

For starters I used PySqLite, using a table with index on wordId. It
works fine but it consumes way too much space on the disk. For example
for 19103 pairs it takes 1686528 bytes, which amounts to around 88
bytes per pair of integers, way too much in my opinion (especially
considering the range of those integers - one is in the range
1..100000 and the other in the range 1..500000).

So my problem now is what should I use to store my data and still be
able to retrieve all messages for a given word very quickly. Either I
find a workaround in sqlite to make it store values on less bytes, or
I use something else.

I looked at the (g)dbm family but they store one entry per word and
both must be strings. To use it would mean to store a list of messages
for a word as a string, then when I find a new message containing this
word to replace this string with old string + the new message id ->
and I think this is inefficient.

At the moment I am looking at BerkeleyDB - still checking if this will
suit my needs.

Any advices and hints ? I'm looking only for embedded solutions like
PySqLite, if a database needs to install a server then it's not an
option.

Thanks in advance,
Andrei

Paul Rubin

unread,

Dec 23, 2003, 7:59:57 AM12/23/03

to

and...@mail.dntis.ro (Stormbringer) writes:
> I want to implement a fulltext search for messages in a forum. More
> exactly for each message I store pairs (wordId, msgId) for each
> identified word and when I search something I want to be able to
> retrieve very quickly all msgId for a given wordId.

Using any kind of database in a big message system may thrash your
disk pretty badly if you update the database every time a new message
is entered. That means making hundreds of new database entries, one
for each word in the message, each entry needing a disk seek. If
you're adding a few new messages a second, it will be hard for your
disk to keep up.

Anyway you don't need a real database. You're not doing that kind of
accesses, you don't need multi-phase transactions, blah blah blah. I
think the way to do it instead is keep several tables:

1) a static table on disk, simply (wordId, msgId) pairs of 4-bit
binary ints, sorted on wordId. You can also have an index for
it: a table with one entry for each wordId, saying where in
the big table the msgId's for that word start.

2) an update log on disk, (wordId, msgId) pairs like above, but
unsorted.

3) an in-memory table (Python dict) containing the same pairs as
the update log, but using a dict structure for instant lookup.

When a new message comes in, you just append all its (wordId,msgId)
pairs to the update log (just append, no random seeking needed), and
also update the in-memory table. Actually you don't need to store
(wordId,msgId) for every pair. Just append a record that says
"next nnn records are for msgId xxx" and then only store the wordId's.

When a search query comes in, look in the in-memory table by hashing,
and if the word isn't found, look in the static table by binary search.

Then once a night, update the static table by running the update log
through a sort utility and then merging it with the static table.
Then you can delete the update log and in-memory table and begin them
again. The sort-merge phase is relatively fast because sorting
utilities are written to sling around large buffers, minimizing disk
head motion.

If the system crashes or you stop the server for some reason, on
restart you can simply read the update log into memory.

You probably don't even need the in-memory structure if you're not
taking too many search queries--the update log should be small enough
that you can just sequentially scan it when a query comes in.

The main idea is that you replace a lot of random small updates with
occasional big sorting operations. The big sorts are far more
i/o-efficient than the small updates. These are the methods they
always used in the old days, when even random-access disk space was
very expensive, so big data sets were always stored on completely
sequential media (magnetic tape).

Jp Calderone

unread,

Dec 23, 2003, 9:27:29 AM12/23/03

to pytho...@python.org

On Tue, Dec 23, 2003 at 04:35:50AM -0800, Stormbringer wrote:
> Hi,
>
> I want to implement a fulltext search for messages in a forum. More
> exactly for each message I store pairs (wordId, msgId) for each
> identified word and when I search something I want to be able to
> retrieve very quickly all msgId for a given wordId.
>

A pure Python fulltext indexer - http://divmod.org/Lupy/index.html

Jp

signature.asc

John Hunter

unread,

Dec 23, 2003, 9:22:51 AM12/23/03

to Stormbringer, pytho...@python.org

>>>>> "Stormbringer" == Stormbringer <and...@mail.dntis.ro> writes:

Stormbringer> Hi, I want to implement a fulltext search for
Stormbringer> messages in a forum. More exactly for each message I
Stormbringer> store pairs (wordId, msgId) for each identified word
Stormbringer> and when I search something I want to be able to
Stormbringer> retrieve very quickly all msgId for a given wordId.

Stormbringer> For starters I used PySqLite, using a table with
Stormbringer> index on wordId. It works fine but it consumes way
Stormbringer> too much space on the disk. For example for 19103
Stormbringer> pairs it takes 1686528 bytes, which amounts to
Stormbringer> around 88 bytes per pair of integers, way too much
Stormbringer> in my opinion (especially considering the range of
Stormbringer> those integers - one is in the range 1..100000 and
Stormbringer> the other in the range 1..500000).

What about using a binary file of unsigned ints which you load into a
python dictionary and do everything in memory? There would be no
extra overhead in the file and it would be very fast, if you are able
to hold the 100,000 ints in memory.

JDH

Paul Rubin

unread,

Dec 23, 2003, 10:09:09 AM12/23/03

to

John Hunter <jdhu...@ace.bsd.uchicago.edu> writes:
> Stormbringer> in my opinion (especially considering the range of
> Stormbringer> those integers - one is in the range 1..100000 and
> Stormbringer> the other in the range 1..500000).
>
> What about using a binary file of unsigned ints which you load into a
> python dictionary and do everything in memory? There would be no
> extra overhead in the file and it would be very fast, if you are able
> to hold the 100,000 ints in memory.

No it's much worse than that. The 100,000 ints are index numbers
for individual words. The 500,000 ints are articles and there can
be thousands of words in each article. So you may need to store
billions of ints, not just 100,000 of them.

Skip Montanaro

unread,

Dec 23, 2003, 12:00:12 PM12/23/03

to Stormbringer, pytho...@python.org

andrei> So my problem now is what should I use to store my data and
andrei> still be able to retrieve all messages for a given word very
andrei> quickly.

There's the rub. PySQLlite consumes a lot of disk space presumably to
support efficient access to the data (hashing keys, etc). If you want
compact storage you aren't likely to do much better than

wordid1:msgid1
wordid2:msgid2
...

Unfortunately, searching that will probably be a bit slow.

If the word ids and msg ids have fixed maximum sizes you might try zero- or
space- padding them so your records are of fixed length. You can then store
the file sorted by wordid and do a binary search in the file looking records
of interest.

Another possibility, assuming your database keys are integers, is to try the
bsddb recno format database. I've never used it before (not as much call
for a database file with integer keys), but it seems to be *relatively*
space-efficient. I believe it doesn't actually need to store the keys,
since they are implicit.

I built a recno db file containing the pickle of the squares of the numbers
in range(1, 100000):

>>> for n in range(1,100000):
... db1[n] = pickle.dumps(n*n)
...

The total string length of the pickled squares was 1.3MB. The file size was
2.1MB.

Skip

Stormbringer

unread,

Dec 23, 2003, 3:54:39 PM12/23/03

to

Jp Calderone <exa...@intarweb.us> wrote in message news:<mailman.75.1072189...@python.org>...

Thanks ! This is exactly what I needed, and the size of the indexes is
around 30%, much much less than what I could have achieved with my
code. Not to mention the fact that I get phrase search and some other
goodies :)

The only thing that bothers me a little is the speed for building the
index, I tried with around 5000 messages and I am not quite thrilled,
it's not _extremly_ slow but it has to be faster for what I need.
Perhaps I'll use the C++ version with some Python bindings.

Cheers,
Andrei

Stormbringer

unread,

Dec 23, 2003, 3:56:46 PM12/23/03

to

Paul Rubin <http://phr...@NOSPAM.invalid> wrote in message news:<7xad5jh...@ruckus.brouhaha.com>...

Yes, I agree. The messages aren't very large (under 5K each, although
I've seen one or two of 500K) but there are lots of words.

Turns out there is a much better solution than what I could have
written, and it's named Lupy.

Andrei

Paul Rubin

unread,

Dec 23, 2003, 4:04:50 PM12/23/03

to

and...@mail.dntis.ro (Stormbringer) writes:
> The only thing that bothers me a little is the speed for building the
> index, I tried with around 5000 messages and I am not quite thrilled,
> it's not _extremly_ slow but it has to be faster for what I need.
> Perhaps I'll use the C++ version with some Python bindings.

Why not do some profiling first. Maybe it's limited by i/o traffic
rather than cpu cycles. I don't know how Lupy works but the one time
I messed with full text indexing, the bottleneck was definitely the
random disk accesses needed for every word of each update. The
solution is to batch the updates. Sorting is much less seek intensive
than random updates.

Jp Calderone

unread,

Dec 23, 2003, 5:54:13 PM12/23/03

to pytho...@python.org

Yea, I hear that. Work is being done on speeding it up (pretty much the
only development on it now is optimization). I don't know how it will end
up, but things look promising so far. On the other hand, if you don't want
to wait for that to be finished...

Jp

signature.asc

Stormbringer

unread,

Dec 24, 2003, 4:44:49 AM12/24/03

to

Paul Rubin <http://phr...@NOSPAM.invalid> wrote in message news:<7xr7yvj...@ruckus.brouhaha.com>...

Sounds like a poorly designed system the one you messed with.
Even if it was limited i/o traffic I am not sure how to do profiling
to find out if this is the case (although I do have doubts about this)
- any suggesions how ?

Andrei

Stormbringer

unread,

Dec 24, 2003, 4:48:46 AM12/24/03

to

Jp Calderone <exa...@intarweb.us> wrote in message news:<mailman.87.1072220...@python.org>...

> On Tue, Dec 23, 2003 at 12:54:39PM -0800, Stormbringer wrote:
> > Jp Calderone <exa...@intarweb.us> wrote in message news:<mailman.75.1072

> 189772.684....@python.org>...

> > > On Tue, Dec 23, 2003 at 04:35:50AM -0800, Stormbringer wrote:
> > > > Hi,
> > > >
> > > > I want to implement a fulltext search for messages in a forum. More
> > > > exactly for each message I store pairs (wordId, msgId) for each
> > > > identified word and when I search something I want to be able to
> > > > retrieve very quickly all msgId for a given wordId.
> > > >
> > >
> > > A pure Python fulltext indexer - http://divmod.org/Lupy/index.html
> >
> > Thanks ! This is exactly what I needed, and the size of the indexes is
> > around 30%, much much less than what I could have achieved with my
> > code. Not to mention the fact that I get phrase search and some other
> > goodies :)
> >
> > The only thing that bothers me a little is the speed for building the
> > index, I tried with around 5000 messages and I am not quite thrilled,

> > it's not extremly slow but it has to be faster for what I need.

> > Perhaps I'll use the C++ version with some Python bindings.
> >
>
> Yea, I hear that. Work is being done on speeding it up (pretty much the
> only development on it now is optimization). I don't know how it will end
> up, but things look promising so far. On the other hand, if you don't want
> to wait for that to be finished...

Well - that depends. If there will be a faster version of lupy when
I'll really need it in 1-2 months then I will use that. Else if I can
find a faster equivalent I will use that. Just beeing practical.

Andrei

Stormbringer

unread,

Dec 24, 2003, 5:07:10 AM12/24/03

to

Skip Montanaro <sk...@pobox.com> wrote in message news:<mailman.81.1072198...@python.org>...

> andrei> So my problem now is what should I use to store my data and
> andrei> still be able to retrieve all messages for a given word very
> andrei> quickly.
>
> There's the rub. PySQLlite consumes a lot of disk space presumably to
> support efficient access to the data (hashing keys, etc). If you want
> compact storage you aren't likely to do much better than
>
> wordid1:msgid1
> wordid2:msgid2
> ...
>
> Unfortunately, searching that will probably be a bit slow.

Well - I want to use this storage to speed up my searches so searching
will really need to be fast.

I didn't want the most compact storage, but at 88 bytes to store a
couple of integers is just way too much.

Another reason I wanted to use a database like sqlite is that it
presents some safety guarantees. If something goes wrong when updating
(like no diskspace left on end-user's disk) then the data is still
says safe, i.e. the latest transaction is not commited and that's it.

I wrote a similar version of what I wanted a couple of years ago not
using a database, just C/C++ code, something like this (I think a
similar method was suggested in this thread) : I was processing
messages in batches, so for each 10000 messages or so I would make in
memory a list of words and in what messages occur, and for each word I
would write all messages that contain it as consecutive entries in a
file. Then I would update the global word index (kept in another
file), which for each work kept a linked list of zones in the other
file where msgIds containg this word were.

Worked pretty well, but this was for a server I controlled, i.e. at
all times I was sure I had enough space and I was controling it. For
an application to be deployed to end-users I need more safety. That is
why I am playing with sql.

> If the word ids and msg ids have fixed maximum sizes you might try zero- or
> space- padding them so your records are of fixed length. You can then store
> the file sorted by wordid and do a binary search in the file looking records
> of interest.

Still would be too slow for a search, and search is the main reason
for this structure.

> Another possibility, assuming your database keys are integers, is to try the
> bsddb recno format database. I've never used it before (not as much call
> for a database file with integer keys), but it seems to be *relatively*
> space-efficient. I believe it doesn't actually need to store the keys,
> since they are implicit.
>
> I built a recno db file containing the pickle of the squares of the numbers
> in range(1, 100000):
>
> >>> for n in range(1,100000):
> ... db1[n] = pickle.dumps(n*n)
> ...
>
> The total string length of the pickled squares was 1.3MB. The file size was
> 2.1MB.

Interesting.
Fortunately I found lupy which suits me, but if I were not I might
have been tempted to experiment with that.

Andrei

Paul Rubin

unread,

Dec 24, 2003, 5:28:51 AM12/24/03

to

and...@mail.dntis.ro (Stormbringer) writes:
> Sounds like a poorly designed system the one you messed with.

The system I messed with wasn't designed for high volumes of updates.

> Even if it was limited i/o traffic I am not sure how to do profiling
> to find out if this is the case (although I do have doubts about this)
> - any suggesions how ?

A crude but maybe useful thing you can do is just observe the CPU load
while the program is running. If the CPU load is low but you hear the
disk banging around, you're i/o bound. As for profiling, there's a
Python profiler documented in the Python library manual, that's the
first thing to try.

There's a good article series on full text indexing at tbray.com which
might interest you, by the way. It was linked from Slashdot a couple
days ago.

Fred Pacquier

unread,

Dec 24, 2003, 5:41:59 AM12/24/03

to

Paul Rubin <http://phr...@NOSPAM.invalid> said :

>> Even if it was limited i/o traffic I am not sure how to do profiling
>

> A crude but maybe useful thing you can do is just observe the CPU load
> while the program is running. If the CPU load is low but you hear the
> disk banging around, you're i/o bound.

I love this one. It would make a great intro quote for the next DrDobbs
weekly summary ! ;-)

--
YAFAP : http://www.multimania.com/fredp/

Gabriel Genellina

unread,

Dec 24, 2003, 1:49:15 PM12/24/03

to pytho...@python.org

At 23/12/2003 04:35, you wrote:

>I want to implement a fulltext search for messages in a forum. More
>exactly for each message I store pairs (wordId, msgId) for each
>identified word and when I search something I want to be able to
>retrieve very quickly all msgId for a given wordId.
>

>So my problem now is what should I use to store my data and still be
>able to retrieve all messages for a given word very quickly. Either I
>find a workaround in sqlite to make it store values on less bytes, or
>I use something else.

Try the shelve module: it comes with Python, it's like a persistent
dictionary whith key=string, value=anything. Uses dbm and pickle internally.
Instead of storing pairs, you could store:
wordId: (tuple of msgIds)
so your primary search is just like a dictionary lookup: db[wordId]

Gabriel Genellina
Softlab SRL

Stormbringer

unread,

Dec 26, 2003, 9:46:51 AM12/26/03

to

Fred Pacquier <xne...@fredp.lautre.net> wrote in message news:<Xns945B76D79C...@213.228.0.32>...

I ment I didn't know how to do profiling in this case, to find out if
the program was cpu or disk bound, although if I would have sit down
and thought for 5 seconds it was obvious.

Profiling in general is another fish and I consider myself able to
spot the places where my programs consumes cpu cycles.

Cheers,
Andrei

Karol Zadora

unread,

Dec 27, 2003, 1:21:24 PM12/27/03

to

"Paul Rubin" <http://phr...@NOSPAM.invalid> wrote in message

news:7xk74mi...@ruckus.brouhaha.com...

On Windows, to determine if you are I/O bound, you might try to watch
these two counters:
- Physical Disk: Avg. Disk Queue Length (should be not more than three
times the number of physical disks (spindles) used by your program)
- Physical Disk: Avg. Disk sec/Transfer (should be less than 15-18 ms)

HTH, Karol

Samuel Walters

unread,

Dec 30, 2003, 5:33:39 PM12/30/03

to

On Tue, 23 Dec 2003 17:54:13 -0500, Jp Calderone wrote:
> Yea, I hear that. Work is being done on speeding it up (pretty much the
> only development on it now is optimization). I don't know how it will end
> up, but things look promising so far. On the other hand, if you don't
> want to wait for that to be finished...

Ye olde time mantra:
Make it work
Make it right
Make it fast

Samuel Walters

unread,

Dec 30, 2003, 5:50:40 PM12/30/03

to

On Wed, 24 Dec 2003 02:07:10 -0800, Stormbringer wrote:
> I wrote a similar version of what I wanted a couple of years ago not
> using a database, just C/C++ code, something like this (I think a
> similar method was suggested in this thread) : I was processing messages
> in batches, so for each 10000 messages or so I would make in memory a
> list of words and in what messages occur, and for each word I would
> write all messages that contain it as consecutive entries in a file.
> Then I would update the global word index (kept in another file), which
> for each work kept a linked list of zones in the other file where msgIds

> containing this word were.

>
> Worked pretty well, but this was for a server I controlled, i.e. at all

> times I was sure I had enough space and I was controlling it. For an

> application to be deployed to end-users I need more safety. That is why
> I am playing with sql.

For future reference:
If you're like me, and have a lot of c/c++ code that you've built and are
happy with, pyrex ( http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/ )
can come in handy for reusing that code within python. Pyrex is easier to
write and maintain than the direct c-to-python API, and with a little bit
of practice, you can learn how to easily marshal data back and forth
between the two languages. Essentially, it takes all the grunt-work out
of writing making c/c++ calls from python. It's especially tasty when you
need to make use of some shared library that python has no interface to.

Sam Walters