Efficient way to structure my data model

5 views
Skip to first unread message

ecognium

unread,
Jun 22, 2009, 3:51:09 AM6/22/09
to Google App Engine
Hi All,

I would like to get your opinion on the best way to structure my
data model.
My app allows the users to filter the entities by four category types
(say A,B,C,D). Each category can have multiple values (for e.g.,
category type A can have values 1,2,3) but the
user can choose only one value per category for filtering. Please
note the values are unique across the category types as well. I could
create four fields corresponding to the four types but it does not
allow me to expand to more categories later easily. Right now, I just
use one list field to store the different values as it is easy to add
more category types later on.

My model (simplified) looks like this:



class Example(db.Model):

categ = db.StringListProperty()

keywords = db.StringListProperty()



The field keywords will have about 10-20 values for each entity. For
the above example, categ will have up to 4 values. Since I allow for
filtering on 4 category types, the index table gets large with
unnecessary values. The filtering logic looks like:
keyword = 'k' AND categ = '1' AND categ = '9' AND categ = '14' AND
categ = '99'

Since there are 4 values in the categ list property, there will be
4^4 rows created in the index table (most of them will never be hit
due to the uniqueness guaranteed by design). Multiply it by the number
of values in the keywords table, the index table gets large very
quickly.

I would like to avoid creating multiple fields if possible because
when I want to make the number of category types to six, I would have
to change the underlying model and all the filtering code. Any
suggestions on how to construct the model such that it will allow for
ease of expansion in category types yet still not create large index
tables? I know there is a Category Property but not sure if it really
provides any specific benefit here.

Thanks!
-e

Nick Johnson (Google)

unread,
Jun 22, 2009, 5:10:29 AM6/22/09
to google-a...@googlegroups.com
Hi ecognium,

If I understand your problem correctly, every entity will have 0-4 entries in the 'categ' list, corresponding to the values for each of 4 categories (eg, Color, Size, Shape, etc)?

The sample query you give, with only equality filters, will be satisfiable using the merge join query planner, which doesn't require custom indexes, so you won't have high indexing overhead. There will simply be one index entry for each item in each list.

If you do need custom indexes, the number of index entries, isn't 4^4, as you suggest, but rather smaller. Assuming you want to be able to query with any number of categories from 0 to 4, you'll need 3 or 4 custom indexes (depending on if the 0-category case requires its own index), and the total number of index entries will be 4C1 + 4C2 + 4C3 + 4C4 = 4 + 6 + 4 + 1 = 15. For 6 categories, the number of entries would be 6 + 15 + 20 + 15 + 6 + 1 = 63, which is still a not-unreasonable number.

-Nick Johnson
--
Nick Johnson, App Engine Developer Programs Engineer
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047
Message has been deleted

ecognium

unread,
Jun 22, 2009, 8:58:43 PM6/22/09
to Google App Engine
Thanks, Nick. Let me make sure I understand your comment correctly.
Suppose I have the following data:

ID Blob1 Blob2-N Keywords Categ
====================================
123 blah blah tag1,tag2,tag3 Circle,Red, Large, Dotted
345 blah blah tag3,tag4,tag5 Square, Blue, Small, Solid
678 blah blah tag1,tag3,tag4 Circle, Blue, Small, Solid
------------------------------------------------------------------------------

The field categ (list) contains four different types - Shape, Color,
Size and Line Type. Suppose the user wants to retrieve all entities
that are Small Dotted Blue Circles then the query will be:

Select * From MyModel where categ = "Circle" AND categ = "Small" AND
categ = "Blue" AND categ = "Dotted"

When I was reading about exploding indexes the example indicated the
issue was due to Cartesian product of two list elements. I thought the
same will hold true with one list field when used multiple times in a
query. Are you saying the above query will not need {Circle, Red,
Large, Dotted} * {Circle, , , } * {Circle, , , } * {Circle, , , }
number of index entities for entity ID=123? I was getting index errors
when I was using the categ list property four times in my index
specification and that's why I was wondering if I should restructure
things. so I am guessing the following spec should not cause any index
issues in the future?

- kind: MyModel
properties:
- name: categ
- name: categ
- name: categ
- name: categ
- name: keywords
- name: __key__ # used for paging

Thanks,
-e

On Jun 22, 2:10 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:

Nick Johnson (Google)

unread,
Jun 23, 2009, 6:53:57 AM6/23/09
to google-a...@googlegroups.com
Hi ecognium,

On Tue, Jun 23, 2009 at 1:35 AM, ecognium <ecog...@gmail.com> wrote:

Thanks, Nick. Let me make sure I understand your comment correctly.
Suppose I have the following data:

ID      BlobProp1       BlobProp2-N     Keywords                        Categ
=================================================
123     blah                    blah                    tag1,tag2,tag3  Circle,

Red,  Large, Dotted
345     blah                    blah                    tag3,tag4,tag5
Square, Blue, Small, Solid
678     blah                    blah                    tag1,tag3,tag4
Circle, Blue, Small, Solid
-------------------------------------------------------------------------------------------------------------


The field categ (list) contains four different types - Shape, Color,
Size and Line Type. Suppose the user wants to retrieve all entities
that are Small Dotted Blue Circles then the query will be:

Select * From MyModel where categ = "Circle" AND categ = "Small" AND
categ = "Blue" AND categ = "Dotted"

When I was reading about exploding indexes the example indicated the
issue was due to Cartesian product of two list elements. I thought the
same will hold true with one list field when used multiple times in a
query.

That is indeed true, though it's not quite the cartesian product - the datastore won't bother indexing (Circle, Circle, Circle, Circle), or (Dotted, Dotted, Dotted, Dotted) - it only indexes every unique combination, which is a substantially smaller number than the cartesian product. It's still only tractable for small lists, though, such as the 4 item lists you're dealing with.

Are you saying the above query will not need {Circle, Red,
Large, Dotted} * {Circle, , , } * {Circle, , , } * {Circle, , , }
number of index entities for entity ID=123?

Correct - if you're not specifying a sort order, you can execute the query without any composite indexes whatsoever. The datastore satisfies equality-only queries using a merge join strategy.
 
I was getting index errors
when I was using the categ list property four times in my index
specification and that's why I was wondering if I should restructure
things.

How many items did you have in the list you were indexing in that case? If your list has 4 items and your index specification lists it 4 times, you should only get one index entry.

so I am guessing the following spec should not cause any index
issues in the future?

Again, that depends on the number of entries in the 'categ' list. With 4 entries, this will only generate a single index entry, but the number of entries will expand exponentially as the list increases in size.

-Nick Johnson



- kind: MyModel
 properties:
 - name: categ
 - name: categ
 - name: categ
 - name: categ
 - name: keywords
 - name: __key__   # used for paging

Thanks,
-e


On Jun 22, 2:10 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:

ecognium

unread,
Jun 23, 2009, 1:11:55 PM6/23/09
to Google App Engine
Thanks again - this is very helpful. I will let you know if i run into
any future index creation errors as it could have been caused by any
number of other entries - i mistakenly thought it was all these categ
list-based entries.

So if i understand it right even with a 10 element list for keywords,
there will only be 10 rows when 4 categ fields are used. In the event
I use 'categ' only once in my query along with keywords field, it
will have up to 40 rows (10 from keywords and 4C1 from categ list). Am
I adding these up right?

I do not see myself going beyond 6 elements in the categ list at this
point (I guess the max will be 6C3 = 20 under such a situation). The
keyword list will be probably go into the 20s but do not see anything
beyond that and will always be used only once in the query.

Thanks,
-e

On Jun 23, 3:53 am, "Nick Johnson (Google)" <nick.john...@google.com>

Nick Johnson (Google)

unread,
Jun 24, 2009, 7:19:36 AM6/24/09
to google-a...@googlegroups.com
Hi ecognium,

On Tue, Jun 23, 2009 at 6:11 PM, ecognium <ecog...@gmail.com> wrote:

Thanks again - this is very helpful. I will let you know if i run into
any future index creation errors as it could have been caused by any
number of other entries - i mistakenly thought it was all these categ
list-based entries.

So if i understand it right even with a 10 element list for keywords,
there will only be 10 rows when 4 categ fields are used.

Correct - 4C4 * 10C1 index entries for the custom index you specified earlier.
 
In the event
I use  'categ' only once in my query along with keywords field, it
will have up to 40 rows (10 from keywords and 4C1 from categ list).

Correct.
 
Am
I adding these up right?

I do not see myself going beyond 6 elements in the categ list at this
point (I guess the max will be 6C3 = 20 under such a situation).

In the case of your index with 4 occurrences, this will rather be 6C4.
 

ecognium

unread,
Jun 24, 2009, 9:44:12 AM6/24/09
to Google App Engine
Thanks very much! I think I understand indexing with Lists much better
now...
>> In the case of your index with 4 occurrences, this will rather be 6C4.
Yup, I was looking at the max situation when I use 3 categ fields.
Need to be careful as it grows pretty quickly.


On Jun 24, 4:19 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
> Hi ecognium,
Reply all
Reply to author
Forward
0 new messages