Design Question: How to implement tree structures

415 views
Skip to first unread message

gopher

unread,
Dec 15, 2009, 12:37:42 PM12/15/09
to mongodb-user, 20...@benjamin-schweizer.de
Hello there,

I'm currently working on a pet project (wiki.sickos.org/pmon) and I
want to use a tree structure to represent all data. Using Mongo for
this has some advantages, but as it lacks features that traditional
rdbms have (like table locking and transactions), it is easy to create
race conditions during simultanous operations.

I believe this problem is pretty generic and others are facing it,
too. So I hope we can emphasise common problems with tree tables and
add some best practices here.

I'll start with a simple tree table. Let's assume we have a collection
of nodes, all nodes look like this:

$node = {
'parent': _id
'children': [_id, _id, ...]
...
}

Now I want to bring on some challenges and possible solutions.:

1st Challenge: adding two nodes simultanously
1st code:
add() {
get $parent
append _id to $parent.children
save $parent
}

1st trace:
process1 reads $parent
process2 reads $parent
process1 extends $parent.children in memory
process1 stores $parent
process2 extends $parent.children in memory
process2 stores $parent # again!

1st problem:
we overwrite data that was update in-between and so we loose the link
to one newly created child

1st solution:
I think this should be solved using the atomic $push operation (can
someone make an example?)

2nd Challenge: deleting and adding simultanously
2nd code:
delete($node) {
get $node
for $child in $node.children {
delete $child
}
delete $node
}
add() {
get $parent
save $child
append child-id to $parrent
save $parent
}

2nd flow:
process1 reads $parent
process2 reads $parent
process2 saves $child
process 2 extends $parent.children in memory
process2 saves $parent
process1 deletes $parent.children recursively
process 1 deletes $parent # but misses the newly created child

2nd problem:
Data has changed while we deleted children, we loose the link to one
child

2nd solution:
We could run a conditional delete (adding the list of children as a
criteria). This would prevent the deletion of the parent, but it would
delete the children anyways. So we would have broken references that
point to deleted items. It would be better, we delete the parent node
first and then walk the children recursively.

However, there's still a problem, see 3rd problem:

3rd flow (same code as 2nd problem):
process1 reads $parent
process2 reads $parent
process1 deletes $parent
process1 deletes $parent.children recursively
process2 creates and saves new $child
process2 extends $parent.children using atomic $push # and fails

3rd problem:
Data has changed while we expected! it to exists

3rd solution:
It's complicated:
- it might be possible to create a new parent but this depends on the
specific implementation and it could be complicated (and fail again).
- we could implement a local rollback and delete the newly created
children again. It's somewhat ugly in general because higher levels of
the application would expect this operation to succeed, but in
practice I'd give the delete the precedence. Aside of this, it is
imperformant as we possibly do a lot of work that would not be
necessary with traditonal locking. Any ideas here or better solution
here?


Greetings
Benjamin

Mathias Stearn

unread,
Dec 15, 2009, 1:23:34 PM12/15/09
to mongod...@googlegroups.com
To handle single-object locking I'd suggest using http://www.mongodb.org/display/DOCS/Atomic+Operations#AtomicOperations-%22UpdateifCurrent%22, and adding a _rev field that is either an incrementing int or a UUID.

Your other requests seem to require ACID transactions which mongo doesn't support. I would suggest using external (to the db) locking to coordinate your processes. You can also use db.eval() to group operations into a pseudo-transaction, but there are some cases where clients could see both new and old data if the queries fall on opposite sides of the db.eval()

An alternative would be to do path-based trees, where each node knows the full path from the root to itself. Assuming you have shallow trees you could encode it as {path: "TREE_ID:GRANDPARENT_ID:PARENT_ID:SELF"}. You could then atomically* and efficiently remove a full subtree by doing a regex-based remove like remove({path: /^TREE_ID:GRANDPARENT_ID:PARENT_ID/}).

*Assuming you are not sharded. Right now all single operations are atomic, and in 1.3.x it is likely that all single modifying operations (like remove, update, insert, etc) will be fully atomic at least at the collection level. This isn't a guarantee that it will always be that way, but it will work for now.

PS- I took a look at http://monitor.sickos.org/ and I wasn't sure what the nesting in the tree meant. Since there are many ways to do trees in Mongo (I'm working on a wiki page based on http://seancribbs.com/tech/2009/09/28/modeling-a-tree-in-a-document-database/) it will be easier to make recomendations if we know the exact use case.


--

You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.



Benjamin Schweizer

unread,
Dec 22, 2009, 4:16:24 AM12/22/09
to mongodb-user
Hello there,

I've reflected over these challenges and came to some insights; I want
to add them here in the case someone else stumples upon this post.

The main problem with cross references (relations) is that they could
change while a process has some state kept in-memory. Traditional
RDBMSes utilise locking here, which is effective on local single-node
systems. However, the design of Mongo (and other nosql databases)
tries to avoid locking and instead results in faults (race
conditions).

To deal with these issues, one should take care of the database design
in general and i) avoid cross references where possible. If there are
cross references left, ii) atomic operations might save the day as
long as they are executed on a single document. If we still end up in
a design as described above, iii) tracking the _rev (revision) and
(checking it on every request) allows the detection of faulty
operations.

From there, the question arises what we can do after we've detected an
inconsistency (fault)? We could either do an explicit roll-back and
raise the error or we could retry this particular operation based upon
the modified data. This is what some other complex systems do: iv)
being fault tolerant and retrying a rejected operation.

In fact, I think fault tolerance is a good solution. But if there are
a lot of faults, it will degrade performance and might end up in an
infinite loop (so you have to think of that and add timeouts and max-
retry checks) but it will avoid the massive locking overhead that
would otherwise hit every single request.

(May be it would be a good idea to implement some basic tolerance
feature on driver level.)


Greetings
Benjamin

--
http://benjamin-schweizer.de/contact


On Dec 15, 7:23 pm, Mathias Stearn <math...@10gen.com> wrote:
> To handle single-object locking I'd suggest usinghttp://www.mongodb.org/display/DOCS/Atomic+Operations#AtomicOperation...,


> and adding a _rev field that is either an incrementing int or a UUID.
>
> Your other requests seem to require ACID transactions which mongo doesn't
> support. I would suggest using external (to the db) locking to coordinate
> your processes. You can also use db.eval() to group operations into a
> pseudo-transaction, but there are some cases where clients could see both
> new and old data if the queries fall on opposite sides of the db.eval()
>
> An alternative would be to do path-based trees, where each node knows the
> full path from the root to itself. Assuming you have shallow trees you could
> encode it as {path: "TREE_ID:GRANDPARENT_ID:PARENT_ID:SELF"}. You could then
> atomically* and efficiently remove a full subtree by doing a regex-based
> remove like remove({path: /^TREE_ID:GRANDPARENT_ID:PARENT_ID/}).
>
> *Assuming you are not sharded. Right now all single operations are atomic,
> and in 1.3.x it is likely that all single modifying operations (like remove,
> update, insert, etc) will be fully atomic at least at the collection level.
> This isn't a guarantee that it will always be that way, but it will work for
> now.
>

> PS- I took a look athttp://monitor.sickos.org/and I wasn't sure what the


> nesting in the tree meant. Since there are many ways to do trees in Mongo

> (I'm working on a wiki page based onhttp://seancribbs.com/tech/2009/09/28/modeling-a-tree-in-a-document-d...)


> it will be easier to make recomendations if we know the exact use case.
>

> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsu...@googlegroups.com>

Reply all
Reply to author
Forward
0 new messages