New ManyToManyField Approach

7 views
Skip to first unread message

spen...@gmail.com

unread,
Nov 23, 2010, 8:40:54 PM11/23/10
to Django users
I've recently started using the following approach when attempting to
reduce the size of ManyToManyField type reference tables. Currently
I'm working on implementing this in a limited fashion into my own
work. I would like to share what I'm working on with you all and see
if it's worth a reaction at all.

Without test data I'm not sure where the trade offs are with the
following. However it should improve the ability to look up items
that reference a certain set of keys as well as easily check to see if
a set already exists. This should also help reduce the number of rows
in a reference table.

Item 1 references Users 1,2,3,4
Item 2 references Users 1,2,3,4
Item 3 references Users 2,3,4

Item 1 would attempt to see if the following long exists in the
reference table for reference.

"""
>>> long(str(adler32('1|2|3|4')) + "000")
150274623000L
"""

No? Add it.

"""
INSERT INTO "apps_app_reference" VALUES(1,150274623000,1);
INSERT INTO "apps_app_reference" VALUES(2,150274623000,2);
INSERT INTO "apps_app" VALUES(1,...,150274623000)
"""

Item 2 would do the same.. it exists and the set is identical.. so
simply use its reference.

"""
INSERT INTO "apps_app" VALUES(2,...,150274623000)
"""

Item 3 checks to see if it's set exists already in the reference
table.

"""
>>> long(str(adler32('2|3|4')) + "000")
78905746000L
"""

No? Add it.

"""
INSERT INTO "apps_app_reference" VALUES(3,78905746000,1);
INSERT INTO "apps_app_reference" VALUES(4,78905746000,2);
INSERT INTO "apps_app" VALUES(3,...,78905746000)
"""

Adding 000 to the end allows me to use the same adler32 result for
multiple sets that collide. If I find a collision then check for the
adler32 result with 001 at the end and so on until no more collisions
occur, allowing up to 999 collisions which may not be enough for some
situations.. more than enough for mine after exhausting 15 gigs of
memory testing the limits of this approach. In my testing if I assume
that any one item will only reference a maximum of 2**7 different
elements, each element ranging between 1 and 2**32.. then after random
testing and 5 million reference IDs nearly 8000 sets were duplicated
and referenced correctly and the collision offset marker for any hash
maxed out at 003.

I decided to try out using adler32 instead of MD5 or SHA based hashes
since selecting data using numeric indexes as equal or between values
seems faster than selecting an equal or like long string. That and it
reduces storage space if you don't expect the need for as many
distinct reference IDs as MD5 or other cryptographic hashes can offer.

In my own project I needed a quick method of selecting Items that
relate to a certain distinct set. I'm unsure of any extra benefit
anyone else can get out of this method. I really like the existing
way that ManyToMany fields work. Especially since the reference table
does all the referencing leaving base table very simple in layout.

- Shane

Christophe Pettus

unread,
Nov 23, 2010, 10:42:52 PM11/23/10
to django...@googlegroups.com

On Nov 23, 2010, at 5:40 PM, sh...@bogomip.com wrote:
> Without test data I'm not sure where the trade offs are with the
> following. However it should improve the ability to look up items
> that reference a certain set of keys as well as easily check to see if
> a set already exists. This should also help reduce the number of rows
> in a reference table.

It definitely does that. The downsides are:

1. Selects that are looking for a particular individual value ("all items that reference user 1") are going to be a sequential scan.

2. You might get collisions on the adler32 function in real life; I'm not familiar enough with that particular function.

If you are using PostgreSQL, you might look at using intarray or hstore fields instead; those have the same advantages you enumerate, but you can also index on them in ways that will speed up searches for individual values.

--
-- Christophe Pettus
x...@thebuild.com

spen...@gmail.com

unread,
Nov 24, 2010, 12:09:02 AM11/24/10
to Django users
On Nov 23, 6:42 pm, Christophe Pettus <x...@thebuild.com> wrote:
> On Nov 23, 2010, at 5:40 PM, sh...@bogomip.com wrote:
>
> > Without test data I'm not sure where the trade offs are with the
> > following.  However it should improve the ability to look up items
> > that reference a certain set of keys as well as easily check to see if
> > a set already exists.  This should also help reduce the number of rows
> > in a reference table.
>
> It definitely does that.  The downsides are:
>
> 1. Selects that are looking for a particular individual value ("all items that reference user 1") are going to be a sequential scan.

Since you've noticed it I have to ask.. since I didn't think it would
be any different.. how does this differ from the current Django
method?

> 2. You might get collisions on the adler32 function in real life; I'm not familiar enough with that particular function.

Oh.. it's pretty easy to get a collision.. only takes around 1000
iterations of random sets in my case. Thats the point of adding a 0
padded number, starting at 0, to the end of the adler32 result.
Allows for padded length amount of collisions.

> If you are using PostgreSQL, you might look at using intarray or hstore fields instead; those have the same advantages you enumerate, but you can also index on them in ways that will speed up searches for individual values.

Outside of the ORM I'll play around with this.

- Shane

Christophe Pettus

unread,
Nov 24, 2010, 1:28:35 AM11/24/10
to django...@googlegroups.com

On Nov 23, 2010, at 9:09 PM, sh...@bogomip.com wrote:
> Since you've noticed it I have to ask.. since I didn't think it would
> be any different.. how does this differ from the current Django
> method?

The standard Django implementation creates an intermediate table with foreign keys back to the tables on each side, so it's simple to do a query against that intermediate model to handle queries like that. (This is the standard SQL way of handling a many-to-many relationship.)

In fact, in your structure, I'm not sure you can actually answer the question "which items refer to user 1," since that information is lost into the hash. If the only question you ever need to answer is "does item 1 refer to this particular set of users?", then that's not a big deal.

>> If you are using PostgreSQL, you might look at using intarray or hstore fields instead; those have the same advantages you enumerate, but you can also index on them in ways that will speed up searches for individual values.
>
> Outside of the ORM I'll play around with this.

You don't need to abandon the ORM to use hstore or intarray; they adapt very nicely to Pyython types and Django model types.

Reply all
Reply to author
Forward
0 new messages