Expando and Index partitioning

18 views
Skip to first unread message

Eli Jones

unread,
Nov 2, 2009, 11:22:44 AM11/2/09
to Google App Engine
Here's something I've been wondering about Expando.

Say you define an Expando model like so:

class meStats(db.Expando):
meNumber = db.IntegerProperty(required=True)


And, then you begin populating it like so:

meEntity1 = meStats(meNumber = 200,
June = 14,
2009 = 6)

meEntity.put()

meEntity2 = meStats(meNumber = 381,
July = 21,
2009 = 7)

meEntity2.put()

..and so on.

The "July" column only has indexes for entities that have "July"
defined.. correct? So, in effect, I am creating a partitioned index
for a table that can grow indefinitely.. and each time I get to a new
year/month combo, I am inserting into new indexes..? (instead of
inserting into an ever increasing, monolithic "Month" column index..)

Mainly, I'm packing the pertinent information into the column names
and column values (instead of making the column name just some dummy
value like "Month").. this allows me to implicitly create the
partitioned table/index (I think of it as a partitioned index since it
is, schematically [as far as I'm concerned], one table.)

You could give the columns better names.. maybe "June_Day" and maybe
"2009_Month" if you wanted...

Does this make sense? Have I misunderstood how Expando handles
indexes?


Another way to word this question would be:

Is there a difference between the indexes created for the June and
July entries in the above Expando model and the below Model models:

class meJune09Stats(db.Model):
meNumber = db.IntegerProperty(required=True)
June = db.IntegerProperty(required=True)
2009 = db.IntegerProperty(required=True)

class meJuly09Stats(db.Model):
meNumber = db.IntegerProperty(required=True)
July = db.IntegerProperty(required=True)
2009 = db.IntegerProperty(required=True)


Thanks for any information.



Tim Hoffman

unread,
Nov 3, 2009, 6:49:52 AM11/3/09
to Google App Engine
Hi

Have you tried this?

For starters you can't assign values to numbers.

ie no matter what you do you can't assign 2009 = 'abc'

You would need to use some other identifier as you mentioned and then
specify something like
year_2009 = db.IntegerProperty(name=2009) or something similiar.

I also see a problem with this strategy with regard to index
definitions.
Whilst running the SDK the indexes will get created as you define data
however once you are running
in real google environment you will need to make sure you have already
defined all possible indexes that you
plan to use before you create any new data (or reindex everything),
which means indexes for all years you plan to hold data for and
search,
and months, and combinations of the two.

I am not sure this is a particularly good approach, but then I am not
sure I get what you are actually doing.

Have you compared the performance of lookups between the two
strategies, also remembering if you are actually interested in year/
month then you are
actually using composite indexes, I wonder if you will ever use the
month only index (apart from comparing months with months for all
years in no particular order)

Rgds

T

Eli Jones

unread,
Nov 3, 2009, 9:26:56 AM11/3/09
to google-a...@googlegroups.com
I haven't done any testing on this yet since I'd have to fill up tens
of gigs of information to see real live performance numbers.

I'm hoping the implicit partitioning makes it so that one doesn't need
manually created indexes (just thedefault ones.)

The example I showed would be a schema for storing a daily int statistic.

The 'June' column entries would show the day of that month and the
'y2009' column would have the 6 value since June is the 6th month of
the year.

If I wanted stats for June, my select would look like this:

Select * From meStats Where y2009 = 6 AND June > 15

This would/should implicitly hit the june rows for 2009 and get the
stats for every day after the 15th.

You could munge around your column names and the values inserted to
get different data reporting behaviour..

The main, potential value is the implicit partitioning (where you
don't need to manually define a bunch of schemas up front).
--
Sent from my mobile device

Tim Hoffman

unread,
Nov 3, 2009, 10:12:25 AM11/3/09
to Google App Engine
Hi

On Nov 3, 10:26 pm, Eli Jones <eli.jo...@gmail.com> wrote:
> I haven't done any testing on this yet since I'd have to fill up tens
> of gigs of information to see real live performance numbers.
>
> I'm hoping the implicit partitioning makes it so that one doesn't need
> manually created indexes (just thedefault ones.)
>
> The example I showed would be a schema for storing a daily int statistic.
>
> The 'June' column entries would show the day of that month and the
> 'y2009' column would have the 6 value since June is the 6th month of
> the year.
>
> If I wanted stats for June, my select would look like this:
>
> Select * From meStats Where y2009 = 6 AND June > 15
>


But the minute you do this ">" you will then need an index that looks
like

- kind: meStats
properties:
- name: y2009
- name: June

and so on for every year month combination where you do a >
comparison.

I think you should have a read about how indexes are created and
accessed before you try optimising something that probably doesn't
need it.

Note the rules from defining index doc
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html#Defining_Indexes_With_Configuration

Other forms of queries require their indexes to be specified in
index.yaml, including:

* queries with multiple sort orders
* queries with a sort order on keys in descending order
* queries with one or more inequality filters on a property and
one or more equality filters over other properties
* queries with inequality filters and ancestor filters

You fall into the third rule. Which as I said eariler will mean you
need to manually specify in index.yaml a massive number of indexes

Rgds

T




> This would/should implicitly hit the june rows for 2009 and get the
> stats for every day after the 15th.
>
> You could munge around your column names and the values inserted to
> get different data reporting behaviour..
>
> The main, potential value is the implicit partitioning (where you
> don't need to manually define a bunch of schemas up front).
>

Eli Jones

unread,
Nov 3, 2009, 4:37:24 PM11/3/09
to Google App Engine
I suggest you watch the IO talk where Brett Slatkin discusses Merge
Joins and pre-computing ranges.

http://www.youtube.com/watch?v=AgaL6NGpkB8

Watch the last half (past 34 min).. and maybe pay attention to the
section that's just after (41 minutes).

This implies you do not need composite indexes (or to create any new
indexes beyond the default ones) for all sorts of queries if you
construct your data in the right way.

I will test this out tonight to provide a proof of concept.



On Nov 3, 10:12 am, Tim Hoffman <zutes...@gmail.com> wrote:
> Hi
>
> On Nov 3, 10:26 pm, Eli Jones <eli.jo...@gmail.com> wrote:
>
>
>
>
>
> > I haven't done any testing on this yet since I'd have to fill up tens
> > of gigs of information to see real live performance numbers.
>
> > I'm hoping the implicit partitioning makes it so that one doesn't need
> > manually created indexes (just thedefault ones.)
>
> > The example I showed would be a schema for storing a daily int statistic.
>
> > The 'June' column entries would show the day of that month and the
> > 'y2009' column would have the 6 value since June is the 6th month of
> > the year.
>
> > If I wanted stats for June, my select would look like this:
>
> > Select * From meStats Where y2009 = 6 AND June > 15
>
> But the minute you do this ">" you will then need an index that looks
> like
>
> - kind: meStats
>   properties:
>   - name: y2009
>   - name: June
>
> and so on for every year month combination where you do a >
> comparison.
>
> I think you should have a read about how indexes are created and
> accessed before you try optimising something that probably doesn't
> need it.
>
> Note the rules from defining index dochttp://code.google.com/appengine/docs/python/datastore/queriesandinde...

Tim Hoffman

unread,
Nov 3, 2009, 5:57:11 PM11/3/09
to Google App Engine
HI Eli

Thats true there are many cases where you don't need composite
indexes, as per the documentation I provided a link to.
However the specific example you gave does. (Now maybe you don't
actually plan to use the specific example you provided
but I don't have anything else to go on)

Now I did actually try it myself before posting \ and the specific
indexes I mentioned get created in index.yaml. And if you run the
dev_server with --require_indexes
and the indexes in question are not present

You get

NeedIndexError: This query requires a composite index that is not
defined. You must update the index.yaml file in your application root.
This query needs this index:
meStats
properties:
- name: y2009
- name: June

Rgds

T

Eli Jones

unread,
Nov 3, 2009, 6:35:01 PM11/3/09
to Google App Engine
yes, I guess the inequality borks it.. I would have to pre-compute day
ranges..

This specific technique would probably be most useful for collecting
hit stats.. where each individual hit was inserted into the Expando
table.

So along with these columns (for june 2009):

Number,June,y2009

You would also have (maybe):
Yahoo,Hour

And it would look like this for an entry from the 15th:

Number,June,y2009,Hour,Yahoo
389, 15, 6, 1400, 1

so.. to get the hit count from Yahoo on June 15th 2009 you would do:

Select * from meStats Where y2009 = 6 AND June = 15 AND Yahoo = 1

Now, I could go all nutty and precompute date ranges to insert along
with the entries as well... but it might not be too much work to just
grab all the June days and pull the ones you wanted.

(This is just the first usage example that comes to mind. This row
naming method could be used for all sorts of set intersection stuff,
and would cut down on insert times due to the fact that it should
partition out the indexes when dealing with humongous datasets).

I already figured out the nutty way I'd use exec to run the variable
put method (tested and works on my dev and live appengine).. I just
hacked this out real fast to see if I could get it to work..
num,monthStr,dayNum etc... are all variables fed into the function
that builds and execs this meStr string (I didn't incorporate the
Yahoo and Hour parts):


meStr = "meEntity = meStats(meNumber = " + str(num) + ", " + monthStr
+ " = " + str(dayNum) + ", " + yearStr + " = " + str(monthNum) + ")
\nmeEntity.put()\n"
exec meStr


Anyway, this was just a side thought I had while wondering what the
point of Expando was.. since it's so unstructured.. I couldn't imagine
why someone would want such an undependable datasource.. but, I can
see this method as being highly useful in a number of cases.. (again,
the main ultimate benefit I see is reducing insert times).

Stephen

unread,
Nov 4, 2009, 9:10:26 AM11/4/09
to Google App Engine


On Nov 3, 11:35 pm, Eli <eli.jo...@gmail.com> wrote:
>
> (This is just the first usage example that comes to mind.  This row
> naming method could be used for all sorts of set intersection stuff,
> and would cut down on insert times due to the fact that it should
> partition out the indexes when dealing with humongous datasets).


I don't think what your proposing is a physical optimisation because
indexes are not discrete objects as they are in a traditional
relational database:

http://code.google.com/appengine/articles/index_building.html#Index%20Layout


Forgetting about physical optimisation and thinking only about
querying slices of the data, how about something like this:

class Stats(db.Model):
number = db.IntegerProperty()
date = db.DateTimeProperty()
dimensions = db.StringListProperty()

e = Stats(number=42,
date=datetime.datetime('1605-11-05'),
dimensions=['1600', 'November', 'Q4', 'd5', 'Saturday'])

Then you could query by date, as in your example, by simply querying
against the date property. But you could also query for all numbers
for any Saturday in the 4th quarter of any year:

q4Saturdays = Stats.all().filter('dimensions =', 'Q4').filter
('dimensions =', 'Saturday')


> Anyway, this was just a side thought I had while wondering what the
> point of Expando was.. since it's so unstructured.. I couldn't imagine
> why someone would want such an undependable datasource..


http://steve-yegge.blogspot.com/2008/10/universal-design-pattern.html

Eli Jones

unread,
Nov 4, 2009, 1:48:14 PM11/4/09
to Google App Engine
Stephen,

My thinking on this is..

Say you have a Month Column.. and it has the default lexically sorted
asc,dsc indexes on it.

My assumption is, when you insert a new row with a Month value
defined, the entire index will have to be updated (no matter how many
shards or tablets it's broken up into).

This is supported by this quote from the Index Building document you
linked to:

"When an entity is created or deleted, all of the index rows for that
entity must be updated."

So.. when you roll around to the June 2010 time period and start
inserting rows for that.. if you have a generic Month column, all the
indexes for every month for every year will have to be updated.

Now, if you had them partitioned out in the way I described,
presumably, you would just need to rebuild the indexes for entries
with a June column.

Granted this isn't most optimal, why not just go all out and do
"June2009", "June2010" columns (and still have the "y2009","y2010"
columns too for quickly grabbing yearly data).. that way.. once the
month of the year is past, those indexes would never need to be
rebuilt again.. yet, since you were using the Expando model, you
wouldn't need seperately defined Models for each MonthYear combo.

Mainly, I see this method as a way of helping BigTable out in
understanding how to partition out my data...

Does this make sense?

I think your ListProperty idea sounds efficient to implement, but I
think it would run into that index updating issue once you got into
the 100s of millions and billions of rows.. every new insert would
require the dimensions and date columns to be rebuilt. Now, these are
my assumptions.. which are fraught with peril... so, I'm trying to
post it here to see if anyone else out there is of a mind to think it
through with me.

Thanks for any input.


On Nov 4, 9:10 am, Stephen <sdea...@gmail.com> wrote:
> On Nov 3, 11:35 pm, Eli <eli.jo...@gmail.com> wrote:
>
>
>
> > (This is just the first usage example that comes to mind.  This row
> > naming method could be used for all sorts of set intersection stuff,
> > and would cut down on insert times due to the fact that it should
> > partition out the indexes when dealing with humongous datasets).
>
> I don't think what your proposing is a physical optimisation because
> indexes are not discrete objects as they are in a traditional
> relational database:
>
> http://code.google.com/appengine/articles/index_building.html#Index%2...

Stephen

unread,
Nov 4, 2009, 4:08:28 PM11/4/09
to Google App Engine


On Nov 4, 6:48 pm, Eli <eli.jo...@gmail.com> wrote:
> Stephen,
>
> My thinking on this is..
>
> Say you have a Month Column.. and it has the default lexically sorted
> asc,dsc indexes on it.
>
> My assumption is, when you insert a new row with a Month value
> defined, the entire index will have to be updated (no matter how many
> shards or tablets it's broken up into).


No. One new entry will be added to the ascending index bigtable, and
one new row will be added to the descending index bigtable.


> This is supported by this quote from the Index Building document you
> linked to:
>
> "When an entity is created or deleted, all of the index rows for that
> entity must be updated."


- By 'entity' they mean a single instance of an etity, i.e. a row, not
a 'type'.

- And by 'all indexes' they mean index entries for all attributes of
that single entity, i.e. an entity with two integer attributes would
have four index entries created, 1 asc and 1 desc for each attribute.

They are just trying to draw a distinction between when an entity
which is created or deleted, and when it is updated, which they cover
in the next sentence:

"However, when an existing entity is updated, the datastore determines
the delta, i.e. the properties that were changed, and only updates the
index rows that include those properties."

I guess this optimisation is worth mentioning in the docs because you
can't do an SQL-like:

update Entity set attr_1 = 42 where key = 99.

...you can only overwrite all attributes, and so you may expect this
to trigger index updates for all attributes. But no, apparently this
is optimised out.

Eli Jones

unread,
Nov 4, 2009, 5:20:19 PM11/4/09
to Google App Engine
I want to get our wordology clear (let me know where I am going
wrong):

A Model in AppEngine is equivalent to a Table in SQL (or, if you like,
it is equivalent to the Table Schema).

An Entity in a Model is equivalent to a Row in a Table. (so.. a Model
is potentially a collection of Entities in the same way a Table is
potentially a collection of Rows)

Properties (of the Entities) in a Model are equivalent to Columns (of
the Rows) in a Table.

Maybe the confusion comes in with the wordage: "all of the index rows
for that entity must be updated"

What is an index row? I am assuming that an "index row" for a certain
property is a row entry in BigTable that contains all index
information for all Entities in the given Model that have that
property defined.

So, there would be an "index row" for the Month Property of your Stats
Model defined above. And, when an Entity was inserted or Deleted for
this Stats Model, the index row for Month would have to be rebuilt. I
am assuming they can't do a diff.. because you have to grow the
index.. there is a new Entity.

If later on you updated a few properties of this newly inserted
entity, then the datastore would know which properties were changed,
and only update the indexes for the changed properties... since the
index space has already been allocated for this new Entity in the
appropriate Index Rows.

I understand that I can be completely wrong on this.. but let me know
where the flaw in my understanding of Entities, Models and Indexes
might be.

Thanks.

Stephen

unread,
Nov 4, 2009, 7:34:13 PM11/4/09
to Google App Engine


On Nov 4, 10:20 pm, Eli <eli.jo...@gmail.com> wrote:
> I want to get our wordology clear (let me know where I am going
> wrong):
>
> A Model in AppEngine is equivalent to a Table in SQL (or, if you like,
> it is equivalent to the Table Schema).
>
> An Entity in a Model is equivalent to a Row in a Table.  (so.. a Model
> is potentially a collection of Entities in the same way a Table is
> potentially a collection of Rows)
>
> Properties (of the Entities) in a Model are equivalent to Columns (of
> the Rows) in a Table.
>
> Maybe the confusion comes in with the wordage:  "all of the index rows
> for that entity must be updated"
>
> What is an index row?  I am assuming that an "index row" for a certain
> property is a row entry in BigTable that contains all index
> information for all Entities in the given Model that have that
> property defined.

No, there is an single index row for each property of each entity.

> So, there would be an "index row" for the Month Property of your Stats
> Model defined above.

Yes.

> And, when an Entity was inserted or Deleted for
> this Stats Model, the index row for Month would have to be rebuilt.

No. The index rows are like the entity rows. Only the rows in the
index which correspond to the properties for that particular entity
need to be inserted/deleted.

> I understand that I can be completely wrong on this.. but let me know
> where the flaw in my understanding of Entities, Models and Indexes
> might be.


I like to think of it something like this:

create table BigTable (
key varchar primary key,
properties blob
);

All the properties are serialised (using Protocol Buffers) into the
blob. This is how you can have entities of the same type with
different properties.

When you put() an entity, the db looks at the properties blob (because
it understands the serialisation format) and if it finds eg. a single
integer property, it inserts a row into the AscendingIndexBigTable,
and another into the DescendingIndexBigTable.

The indexes could look similar to the entity table, except imagine
changing the key key to "eli/E1/a1=42" for the a1 property.

Eli Jones

unread,
Nov 4, 2009, 8:30:45 PM11/4/09
to Google App Engine
Hmm, that's interesting.

So, this suggests that Inserts take the same amount of time.. whether
a Model has 100,000,000 Entities or 100 Entities(presuming the Model
has no extra indexes defined and all Properties of the Model are
required).

Is that really true? If so, I guess it's good and bad. Though, it's
a little annoying since it means I need to do more reading/watching
Datastore info since my understanding is flawed when it comes to Index
structure..

I was watching the "App Engine Datastore Under the Covers"
presentation again.. and I think I understand your description of the
Index Rows now:

http://www.youtube.com/watch?v=tx5gdoNpcZM
Reply all
Reply to author
Forward
0 new messages