Is there any library for indexing binary data?

甜瓜

unread,

Mar 24, 2010, 11:28:58 PM3/24/10

to pytho...@python.org

Howdy,

Recently, I am finding a good library for build index on binary data.
Xapian & Lucene for python binding focus on text digestion rather than
binary data. Could anyone give me some recommendation? Is there any
library for indexing binary data no matter whether it is written in
python?

In my case, there is a very big datatable which stores structured
binary data, eg:
struct Item
{
long id; // used as key
double value;
};

I want to build the index on "id" field to speed on searching. Since
this datatable is not constant, the library should support incremental
indexing. If there is no suitable library, I have to do the index by
myself...

Thank you in advance.

--
ShenLei

Gabriel Genellina

unread,

Mar 25, 2010, 12:29:07 AM3/25/10

to pytho...@python.org

En Thu, 25 Mar 2010 00:28:58 -0300, 甜瓜 <littlesw...@gmail.com>
escribió:

What about a database?

--
Gabriel Genellina

甜瓜

unread,

Mar 25, 2010, 1:53:45 AM3/25/10

to Gabriel Genellina, pytho...@python.org

Well, Database is not proper because 1. the table is very big (~10^9
rows) 2. we should support very fast *simple* query that is to get
value corresponding to single key (~10^7 queries / second).

Currently, I have implemented a specific algorithm to deal with my
problem. However, I want to employ some library to simplify codings,
otherwise I have to write my own code for each big table. It is
possible that, after using indexing library, the program cannot run as
fast as homemade code. But if it can greatly simplify my job and can
provide satisfied speed (eg 10^5~10^6 queries / second), the indexing
library is still a good choice for me.

--
ShenLei

2010/3/25 Gabriel Genellina <gags...@yahoo.com.ar>:

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Irmen de Jong

unread,

Mar 25, 2010, 3:58:28 AM3/25/10

to

On 3/25/10 4:28 AM, 甜瓜 wrote:
> Howdy,
>
> Recently, I am finding a good library for build index on binary data.

> Xapian& Lucene for python binding focus on text digestion rather than

> binary data. Could anyone give me some recommendation? Is there any
> library for indexing binary data no matter whether it is written in
> python?
>
> In my case, there is a very big datatable which stores structured
> binary data, eg:
> struct Item
> {
> long id; // used as key
> double value;
> };
>
> I want to build the index on "id" field to speed on searching. Since
> this datatable is not constant, the library should support incremental
> indexing. If there is no suitable library, I have to do the index by
> myself...
>
> Thank you in advance.
>
> --
> ShenLei

Put it into an Sqlite database? Or something else from
http://docs.python.org/library/persistence.html.
Or maybe http://www.pytables.org/ is more suitable to your needs (never
used that one myself though).
Or install a bank or 2 of memory in your box and read everything into
memory in one big hashtable.

Btw if you already have a big datatable in which the data is stored, I'm
guessing that already is in some form of database format. Can't you
write something that understands that database format.

But I think you need to provide some more details about your data set.

-irmen

Paul Rubin

unread,

Mar 25, 2010, 4:04:23 AM3/25/10

to

甜瓜 <littlesw...@gmail.com> writes:
> Well, Database is not proper because 1. the table is very big (~10^9
> rows) 2. we should support very fast *simple* query that is to get
> value corresponding to single key (~10^7 queries / second).

Just one numeric key/value pair in each row? What's wrong with
universal hashing?

PyJudy might also be of interest:
http://www.dalkescientific.com/Python/PyJudy.html

甜瓜

unread,

Mar 25, 2010, 5:28:25 AM3/25/10

to pytho...@python.org

Thank you Rubin! Let me have a look at Judy. It seems good at first glance.

--
ShenLei

2010/3/25 Paul Rubin <no.e...@nospam.invalid>:

> --
> http://mail.python.org/mailman/listinfo/python-list
>

甜瓜

unread,

Mar 25, 2010, 5:55:27 AM3/25/10

to Irmen de Jong, pytho...@python.org

Thank you irmen. I will take a look at pytable.
FYI, let me explain the case clearly.

Originally, my big data table is simply array of Item:

struct Item
{
long id; // used as key

BYTE payload[LEN]; // corresponding value with fixed length
};

All items are stored in one file by using "stdio.h" function:
fwrite(itemarray, sizeof(Item), num_of_items, fp);

Note that "id" is randomly unique without any order. To speed up
searching I regrouped / sorted them into two-level hash tables (in
the form of files). I want to employ certain library to help me index
this table.

Since the table contains about 10^9 items and LEN is about 2KB, it is
impossible to hold all data in memory. Furthermore, some new item may
be inserted into the array. Therefore incremental indexing feature is
needed.

Hope this help you to understand my case.

--
ShenLei

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Irmen de Jong

unread,

Mar 25, 2010, 2:55:45 PM3/25/10

to

On 25-3-2010 10:55, 甜瓜 wrote:
> Thank you irmen. I will take a look at pytable.
> FYI, let me explain the case clearly.
>
> Originally, my big data table is simply array of Item:
> struct Item
> {
> long id; // used as key
> BYTE payload[LEN]; // corresponding value with fixed length
> };
>
> All items are stored in one file by using "stdio.h" function:
> fwrite(itemarray, sizeof(Item), num_of_items, fp);
>
> Note that "id" is randomly unique without any order. To speed up
> searching I regrouped / sorted them into two-level hash tables (in
> the form of files). I want to employ certain library to help me index
> this table.
>
> Since the table contains about 10^9 items and LEN is about 2KB, it is
> impossible to hold all data in memory. Furthermore, some new item may
> be inserted into the array. Therefore incremental indexing feature is
> needed.

I see, I thought the payload data was small as well. What about this idea:
Build a hash table where the keys are the id from your Item structs and
the value is the file seek offset of the Item 'record' in your original
datafile. (although that might generate values of type long, which take
more memory than int, so maybe we should use file_offset/sizeof(Item).
This way you can just keep your original data file (you only have to
scan it to build the hash table) and you will avoid a lengthy conversion
process.

If this hashtable still doesn't fit in memory use a sparse array
implementation of some sort that is more efficient at storing simple
integers, or just put it into a database solution mentioned in earlier
responses.

Another thing: I think that your requirement of 1e7 lookups per second
is a bit steep for any solution where the dataset is not in core memory
at once though.

Irmen.

甜瓜

unread,

Mar 25, 2010, 9:45:44 PM3/25/10

to pytho...@python.org

Many thanks for your kind reply. As you mentioned, a sparse array may
be the best choice.
Storing offset rather than payload itself can greatly save memory space.

1e7 queries per second is my ideal aim. But 1e6 must be achieved.
Currently I have implemented 5e6 on one PC (without incremental
indexing and all incoming queries coming from local data stream).
Since the table is very big and responding is time critical, the
finally system will be definitely distributed computing. I hope that
Judy algorithm can simplify indexing, so I can focus on implementing
data persistence and distributed computing affairs.

--
ShenLei

> --
> http://mail.python.org/mailman/listinfo/python-list
>

John Nagle

unread,

Mar 25, 2010, 11:28:19 PM3/25/10

to 甜瓜

甜瓜 wrote:
> Well, Database is not proper because 1. the table is very big (~10^9
> rows) 2. we should support very fast *simple* query that is to get
> value corresponding to single key (~10^7 queries / second).

Ah, crypto rainbow tables.

John Nagle

Albert van der Horst

unread,

Apr 4, 2010, 7:28:15 AM4/4/10

to

In article <mailman.1173.1269496...@python.org>,

=?GB2312?B?zPC5zw==?= <littlesw...@gmail.com> wrote:
>Well, Database is not proper because 1. the table is very big (~10^9
>rows) 2. we should support very fast *simple* query that is to get
>value corresponding to single key (~10^7 queries / second).
>
>Currently, I have implemented a specific algorithm to deal with my
>problem. However, I want to employ some library to simplify codings,
>otherwise I have to write my own code for each big table. It is
>possible that, after using indexing library, the program cannot run as
>fast as homemade code. But if it can greatly simplify my job and can
>provide satisfied speed (eg 10^5~10^6 queries / second), the indexing
>library is still a good choice for me.

At first sight this looks to me like B-trees would be an ideal
solution. The first levels of the tree are in memory, so with some
luck you have only one or two disk accesses per search.

>
>--
>ShenLei
>

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst