Optimal tree structure with datastore

334 views
Skip to first unread message

Peter Cicman

unread,
Jul 12, 2009, 3:31:12 PM7/12/09
to Google App Engine
I had spend some time today with building tree structure using
datastore. I need a real tree, for real world, so there must be a
possibility to move node with all descendants. There must also be a
quick way to find all ancestors, children and descendants.

So, datastore keys can not be used for tree like this - because key
can not change, and i would like to prevent node copying/deletion,
because this must be always done on node and all its descendants which
may end up with tons of sql queries.

Design i made should fit for me, but maybe there is some common
solution for this problem... Anybody knows about some optimal
solution? (i can not use serialization)

Thanks a lot!

Nick Johnson (Google)

unread,
Jul 13, 2009, 5:17:55 AM7/13/09
to google-a...@googlegroups.com
Hi Peter,

On Sun, Jul 12, 2009 at 8:31 PM, Peter Cicman<pci...@gmail.com> wrote:
>
> I had spend some time today with building tree structure using
> datastore. I need a real tree, for real world, so there must be a
> possibility to move node with all descendants. There must also be a
> quick way to find all ancestors, children and descendants.

I'm not aware of any tree structure that fits both those requirements
- in fact, I'm fairly sure it's not possible.

There are two basic approaches to representing trees: Ones that
encompass the node's position in the entire tree, such as materialized
path and nested set notation, and ones that only encompass the node's
relationship to neighbouring nodes, such as adjacency lists. The
former permit easily querying all descendants, while the latter make
it easy to move entire branches. I don't think it's possible to have
both at once.

If moving a branch is less common than querying, and your trees are
relatively shallow (Say, less than 100 levels deep), I would recommend
using the materialized path approach: Have a ListProperty that lists
all the ancestors of the node.

-Nick Johnson

>
> So, datastore keys can not be used for tree like this - because key
> can not change, and i would like to prevent node copying/deletion,
> because this must be always done on node and all its descendants which
> may end up with tons of sql queries.
>
> Design i made should fit for me, but maybe there is some common
> solution for this problem... Anybody knows about some optimal
> solution? (i can not use serialization)
>
> Thanks a lot!
> >
>



--
Nick Johnson, App Engine Developer Programs Engineer
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
Number: 368047

Vince Stross

unread,
Jul 13, 2009, 7:51:06 AM7/13/09
to Google App Engine
I use the following structure:

class Tag (db.Expando):
label = StringProperty()
parent_ = SelfReferenceProperty(collection_name='children')

This is a simplified for illustration purposes, but it works well
enough for us because we just grab all Tag entities and build the tree
in memory with a single simple function. Of course, we will never be
more than 3-4 deep and always have less than 100 tags total.


On Jul 13, 5:17 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
> Hi Peter,

Peter Cicman

unread,
Jul 13, 2009, 1:56:18 PM7/13/09
to Google App Engine
Hi Nick, hi Vince, i did come up with something like this yesterday:


class Tree(db.Expando):
parents = ListProperty()
level = IntegerProperty()
position = IntegerProperty()

properties:
parents: sorted list of all parent keys, sorting order top/down,
so parents[-1] gives object direct parent
level: nested level, for root nodes = 0, for children of root node
= 1, etc..
position: node position in current subtree, 0 on top, n on bottom


Probably it will work, parent, children, descendants selection should
be possible. Its probably very close to solution described by Nick.

----

But anyway, the key paths looks like so nice solution for tree
structure, so i'm still thinking about them. It brings up some
questions:

- is there a possibility to use other property for path than a
key? probably not if i understand the implementation

- moving the node with all descendants is currently be possible
this way: copy node, change key, save it, delete original node. the
question is - if the datastore is represented like a hash, why key
change isn't allowed? What is so difficult on this?

- if the key change really isn't possible on datastore level,
can't there be some helper in datastore API for doing this (change
instance key / change instance keys on node + all descendants,
eventually also delete method, which performs descendants deletion
when parent gets removed)..

Thanks...

Tim Hoffman

unread,
Jul 13, 2009, 8:38:42 PM7/13/09
to Google App Engine
HI

I have been down this path and I ended up using
to property lists

one being a list of keys to children and the second
a list of object names, (that way I can traverse the tree by entity
names).
and each object keeps a referenceproperty to it's parent.
That way if I get some arbitrary object I can walk back up the tree.

I didn't use a back ref (ie parent.children.all()) as a result of the
parent reference as it will be segmented by
kind. Which means you have to have multiple reference_sets (one for
each kind). (Unless you use polymodels)

I also abandoned any notion of using the key path because if a node
changes the key path is set so you need to create a whole new entity
and it's children.

There are a couple of advantages to the approach I have used.

1, moving nodes around isn't expensive (ie remove name and key from
one list an put it in another) all the nodes children don't need to be
touched.
2. A node can be a child of multiple entities (though you would then
need to abandon the parent reference held by a child)
3. explicit ordering of children doesn't require touching children.


The downside of this approach is you can't really use transactions/
entity groups as (well you can but everything will be in a single
entity group which sort of defeats the point of entity groups. So it
is really only usefull of the you are unlikely to have concurrent
mutating of the tree.

I am using this in a simple cms based on repoze.bfg which provides
folder based organisation and traversal of various content types such
as pages, news, images, files, galleries etc...
(www.fishandlily.com.au is a site based on this cms)
I will be releasing the code for it as soon as I clean it up an
document it a bit.

See ya

T

Nick Johnson (Google)

unread,
Jul 14, 2009, 5:47:22 AM7/14/09
to google-a...@googlegroups.com
Hi Peter,

On Mon, Jul 13, 2009 at 6:56 PM, Peter Cicman<pci...@gmail.com> wrote:
>
> Hi Nick, hi Vince, i did come up with something like this yesterday:
>
>
> class Tree(db.Expando):
>    parents = ListProperty()
>    level = IntegerProperty()
>    position = IntegerProperty()
>
> properties:
>    parents: sorted list of all parent keys, sorting order top/down,
> so parents[-1] gives object direct parent
>    level: nested level, for root nodes = 0, for children of root node
> = 1, etc..
>    position: node position in current subtree, 0 on top, n on bottom

Won't level and position always be identical?

>
>
> Probably it will work, parent, children, descendants selection should
> be possible. Its probably very close to solution described by Nick.
>
> ----
>
> But anyway, the key paths looks like so nice solution for tree
> structure, so i'm still thinking about them. It brings up some
> questions:

Key names internally use the same sort of structure as you just
outlined using a ListProperty. Given that you want to move your
objects without renaming them, using parent relationships is not a
good idea here.

>
>    - is there a possibility to use other property for path than a
> key? probably not if i understand the implementation

No, only the key works.

>
>    - moving the node with all descendants is currently be possible
> this way: copy node, change key, save it, delete original node. the
> question is - if the datastore is represented like a hash, why key
> change isn't allowed? What is so difficult on this?

You also need to repeat the operation for all child entities of the object.

The datastore is not stored as a hash table, it's stored as a sorted
list. Key changes aren't permitted because they're identical to a
deletion followed by an insertion - so we expose that fact to the
user.

-Nick Johnson

Peter Cicman

unread,
Jul 14, 2009, 3:35:11 PM7/14/09
to Google App Engine
Hi Nick,

no, level and position are not identical.

The idea behind it is to use level for node to know how deep in tree
it exists, so root nodes have level=0, node with one parent have
level=1, node with three parents (A-B-C-D) have level=3, etc.. This
should allow simple move in path.. WHERE level ><...

Position will be used just for sorting. It will start in every subtree
with zero value, and will be incremented when new node is added, so it
will be possible to reorder node position in subtree, something like
this should be possible:

BEFORE:
A(0) level = 0
/ | \
B(0) C(1) D(2) level = 1
/ \
E(0) F(1) level = 2


(Note: number in brackets represents position)

AFTER (node B, C reorder):

A(0) level = 0
/ | \
C(0) B(1) D(2) level = 1
/ \
E(0) F(1) level = 2


What do you think about it? There will be probably maximum few
hundreds of levels. Does it fits for datastore?

Do i have to create index on parents property? If yes, size of this
index will probably grow nonlinear with growing amount of levels. And
i have to put also position, level and yet another two or three
properties under index. What is the maximum index size? I did found
just one information about it in documentation - `its huge`, but what
does huge exactly mean?


My hopefully last question :)

Is there somewhere some paper about deeper explanation of datastore?
It would be nice to know how it exactly works in deeper level, how are
the query counters working (i had some problem there also), etc...

Thanks a lot!
Peter.

On Jul 14, 11:47 am, "Nick Johnson (Google)" <nick.john...@google.com>

Nick Johnson (Google)

unread,
Jul 15, 2009, 9:17:55 AM7/15/09
to google-a...@googlegroups.com
On Tue, Jul 14, 2009 at 8:35 PM, Peter Cicman<pci...@gmail.com> wrote:
>
> Hi Nick,
>
> no, level and position are not identical.
>
> The idea behind it is to use level for node to know how deep in tree
> it exists, so root nodes have level=0, node with one parent have
> level=1, node with three parents (A-B-C-D) have level=3, etc.. This
> should allow simple move in path.. WHERE level ><...
>
> Position will be used just for sorting. It will start in every subtree
> with zero value, and will be incremented when new node is added, so it
> will be possible to reorder node position in subtree, something like
> this should be possible:

Ah. The distinction wasn't clear initially. Given that 'position' in
your example appears to apply to all nodes of a given depth, how will
you efficiently calculate and update this? Inserting a new leftmost
leaf node would require renumbering every other node's position.

>
> BEFORE:
>        A(0)            level = 0
>    /    |    \
>  B(0)   C(1)   D(2)     level = 1
>       /   \
>    E(0)   F(1)         level = 2
>
>
> (Note: number in brackets represents position)
>
> AFTER (node B, C reorder):
>
>        A(0)            level = 0
>    /    |    \
>  C(0)  B(1)  D(2)      level = 1
>  /   \
> E(0)  F(1)              level = 2
>
>
> What do you think about it? There will be probably maximum few
> hundreds of levels. Does it fits for datastore?

Yes, though a 'few hundred' is getting to the point where the
ListProperty of ancestors will take a significant amount of time to
update.

>
> Do i have to create index on parents property? If yes, size of this
> index will probably grow nonlinear with growing amount of levels. And
> i have to put also position, level and yet another two or three
> properties under index. What is the maximum index size? I did found
> just one information about it in documentation - `its huge`, but what
> does huge exactly mean?

You only have to index the properties if you want to execute queries
against them that use more than one property and involve an inequality
or a sort order. See the indexing documentation for more details. The
indexing is limited by index entries per entity, not size - and the
limit is in the low thousands.

>
>
> My hopefully last question :)
>
> Is there somewhere some paper about deeper explanation of datastore?
> It would be nice to know how it exactly works in deeper level, how are
> the query counters working (i had some problem there also), etc...

Reading the Bigtable paper would be a good start.

-Nick Johnson

Peter Cicman

unread,
Jul 16, 2009, 3:11:07 PM7/16/09
to Google App Engine
Hi Nick,

i just found my model will not work efficiently. The problem is not
not position update, there will be max few tens of nodes in subtree,
so i just have to update them, and even if enough big distance between
positions will be chosen, lets say 10000, i can just do simple checks
between two nodes, (e.g. average value from positions which are
around) and change the position in my node, this will mean few selects
with 2 entries, and one update.

Eg:
B, C, D are sub nodes of A
pB (position of B) = 1 000 000, because it was the first node
created under B
pC = 1 010 000
pD = 1 020 000

nov - reoder to get D, B, C will with some few queries and checks
result with:
pB = 1 000 000 (unchanged)
pC = 1 010 000 (unchanged)
pA = (1 000 000 - 0) / 2 = 500 000 (position updated) [0 is there
because A is the first node]

This will give me a possibility for quick changed in position. Of
course there may be a conflict if i would like to `move` node between
two nodes (Left, Right) and when pR - pL = 0 (difference between two
possitions). This may lead to update of position of multiple nodes,
but probably only one 2 nodes will be updated. But for this, there may
be cron checking difference between positions, and... But i'm not
expecting hundreds of positional moves anyway...



The problem is somewhere else - lies in parents[] property. If there
will be a move on lets say second level, and moved node have 5 000
descendants, parents property of all descendants must be updated, this
will probably not be fast enough, or on other words, it will be very
expensive.

Hmm, i need something better here...

But, anyway thanks for your help!
Peter.


Of course, there might be a conflict when the

On Jul 15, 3:17 pm, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:
Reply all
Reply to author
Forward
0 new messages