Representing hierarchies in SQL

14 views
Skip to first unread message

Kiran Jonnalagadda

unread,
Mar 12, 2013, 3:22:11 AM3/12/13
to hasgee...@googlegroups.com
Hey all,

I'm trying to implement a better approach to representing hierarchies in SQL.

The common -- and naïve way -- to do this is what is known as the "adjacency list" model. It's really simple: the database table has a column named 'parent' (or similar) which refers to another row in the same table.

When a row has no parent, it's the root node. To find all the children of a given row, simply find all other rows that reference this one as a parent.

That's it. The simplicity of this model has resulted in its wide use. The adjacency list pattern, however, has a serious flaw: to retrieve the whole tree, you have to recursively visit every row and query for children. The more nested the tree is, the more queries you have to run.

[This, incidentally, is the same problem with nested folders in a filesystem. You can only retrieve the full tree by recursively visiting every folder.]

I first ran into this pattern when implementing threaded comments in Funnel nearly two years ago. I solved the recursion problem by adding a second column that pointed to a "root", so that retrieving the full tree was merely a matter of querying for everything that points to the root.

Here's the implementation in the CommentSpace and Comment models in Funnel: https://github.com/hasgeek/funnel/blob/master/models.py#L123

This clever solution has served us well so far, but it doesn't solve every problem. In particular, there's no way to retrieve a sub-tree. You can only retrieve the full tree from the root level.

Bill Karwin covered the Adjacency List and several other hierarchical patterns in his book SQL Antipatterns. He's also got some material online here: http://www.slideshare.net/billkarwin/models-for-hierarchical-data

The most interesting model there is the Closure Table. It's completely unintuitive at first, but the more I look at it, the more I'm convinced it's brilliant. The closure table requires some additional work during write operations, but makes read operations painless, whereas the adjacency list model optimizes writes over reads.

I've been looking around for implementations to help get started, but they are relatively hard to come by. Most developers seem to be unaware of it. Even Mike Bayer, SQLAlchemy's author, said he wasn't familiar with it (a year ago): https://groups.google.com/d/msg/sqlalchemy/U6wGpO6tl9M/8pyBmZEi2GcJ

Phillip J Eby wrote about it a couple years ago: http://dirtsimple.org/2010/11/simplest-way-to-do-tree-based-queries.html

I want to implement closure tables in Nodular and I'm working on it today.

Why is the ability to retrieve a subtree in a single query important? Because that is how you implement a blog tree. A blog post may be posted to any folder below a given folder, but the feed query should be able to find it in a single request.

Cheap subtree queries are also why the web frameworks world has moved from URL traversal to URL routing as the preferred pattern, even though traversal makes more sense in many situations.

My non-awareness of the closure table pattern was the main reason Eventframe allows a single level of folder hierarchy, and that decision has been a leading cause of pain. It's time to fix it.

This pattern's implementation with SQLAlchemy hasn't been documented anywhere from what I can tell (or in Django for that matter), and because it involves two tables, it requires mucking around in SQLAlchemy internals. I'm hoping to check in a working implementation tonight. Wish me luck.

Kiran

-- 
Kiran Jonnalagadda

Mitesh Ashar

unread,
Mar 12, 2013, 3:30:24 AM3/12/13
to hasgee...@googlegroups.com
Very interesting! Luck - May it be with you!
Sent from BlackBerry® on Airtel

From: Kiran Jonnalagadda <ki...@hasgeek.com>
Date: Tue, 12 Mar 2013 12:52:11 +0530
Subject: [hg-code] Representing hierarchies in SQL
--
You received this message because you are subscribed to the Google Groups "HasGeek Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hasgeek-code...@googlegroups.com.
To post to this group, send email to hasgee...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Kiran Jonnalagadda

unread,
Mar 12, 2013, 5:31:04 AM3/12/13
to hasgee...@googlegroups.com
On Tuesday, 12 March 2013 at 1:00 PM, Mitesh Ashar wrote:
Very interesting! Luck - May it be with you!

So I wrote this, packed up and headed to the office, and on the way realized I'm completely wrong. The closure table pattern isn't what we need in Nodular.

Look at this slide comparing the various approaches: http://www.slideshare.net/billkarwin/models-for-hierarchical-data/69

Now here's the situation: In a path like /foo/bar, we have to ensure that the name "foo" is unique within its container (the root node), and that "bar" is likewise unique within "foo". There can't be two nodes with the same name in the same container.

To do this, I have a `parent_id` column in Node, and a unique constraint on (name, parent_id): https://github.com/hasgeek/nodular/blob/master/nodular/node.py#L27

I do not know of any other way to establish this constraint on `name`. Let me know if you know of a way.

What this means is that we have to use the adjacency list pattern regardless of whatever else we use to make subtree queries easier.

The second major problem is URL traversal. If the user wants to go to /foo/bar/baz, we have to establish this path exists, and that takes at least as many queries as there are path segments (ie, O(n+1) efficiency):

1. Find the root node
2. Find 'foo' within root
3. Find 'bar' within 'foo'
4. Find 'baz' within 'bar'

The deeper the tree gets, the more queries we have to run. This is obviously inefficient, and the closure table pattern does nothing to help. It can retrieve the full tree, but not specific paths.

The path enumeration pattern, however, tackles exactly this problem. It stores the full path as a column in the row, so it's trivial to retrieve a path or subtree. The two problems with path enumeration (as per Karwin's slide above) -- querying for immediate children and maintaining referential integrity -- are both already solved by the adjacency list pattern.

The path enumeration pattern does have one problem: if a node is renamed, the paths of everything below it have to be amended recursively. Fortunately for us, renames are relatively rare, so it's okay for them to be inefficient.

I'm going to have to throw out all of yesterday's work and retool for path enumeration. At least this is much easier to implement.

Kiran

Mitesh Ashar

unread,
Mar 12, 2013, 5:53:43 AM3/12/13
to hasgee...@googlegroups.com
In this scenario, would it be a better option to store data in a graph database like neo4j?

-- 
Mitesh Ashar
Sent with Sparrow

--

kracekumar ramaraju

unread,
Mar 12, 2013, 6:10:53 AM3/12/13
to hasgee...@googlegroups.com
On Tue, Mar 12, 2013 at 3:23 PM, Mitesh Ashar <em...@miteshashar.com> wrote:
In this scenario, would it be a better option to store data in a graph database like neo4j?

-- 
Mitesh Ashar
Sent with Sparrow

On Tuesday, 12 March 2013 at 3:01 PM, Kiran Jonnalagadda wrote:

On Tuesday, 12 March 2013 at 1:00 PM, Mitesh Ashar wrote:
Very interesting! Luck - May it be with you!

So I wrote this, packed up and headed to the office, and on the way realized I'm completely wrong. The closure table pattern isn't what we need in Nodular.

Look at this slide comparing the various approaches: http://www.slideshare.net/billkarwin/models-for-hierarchical-data/69

Now here's the situation: In a path like /foo/bar, we have to ensure that the name "foo" is unique within its container (the root node), and that "bar" is likewise unique within "foo". There can't be two nodes with the same name in the same container.

To do this, I have a `parent_id` column in Node, and a unique constraint on (name, parent_id): https://github.com/hasgeek/nodular/blob/master/nodular/node.py#L27

I do not know of any other way to establish this constraint on `name`. Let me know if you know of a way.

What this means is that we have to use the adjacency list pattern regardless of whatever else we use to make subtree queries easier.

The second major problem is URL traversal. If the user wants to go to /foo/bar/baz, we have to establish this path exists, and that takes at least as many queries as there are path segments (ie, O(n+1) efficiency):

1. Find the root node
2. Find 'foo' within root
3. Find 'bar' within 'foo'
4. Find 'baz' within 'bar'

The deeper the tree gets, the more queries we have to run. This is obviously inefficient, and the closure table pattern does nothing to help. It can retrieve the full tree, but not specific paths.

The path enumeration pattern, however, tackles exactly this problem. It stores the full path as a column in the row, so it's trivial to retrieve a path or subtree. The two problems with path enumeration (as per Karwin's slide above) -- querying for immediate children and maintaining referential integrity -- are both already solved by the adjacency list pattern.

A table with 2 column like urlpath, pkeys_as_dict. E.g: /jsfoo/2013/23-video-name, {'channel': 2, 'playlist': 34, 'video': 23} stored in db. Retrival time is O(1) since it is more like hashtable. Is this how path enumeration or traversal work ? if irrelevant pls ignore.    
The path enumeration pattern does have one problem: if a node is renamed, the paths of everything below it have to be amended recursively. Fortunately for us, renames are relatively rare, so it's okay for them to be inefficient.

I'm going to have to throw out all of yesterday's work and retool for path enumeration. At least this is much easier to implement.

Kiran

--
You received this message because you are subscribed to the Google Groups "HasGeek Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hasgeek-code...@googlegroups.com.
To post to this group, send email to hasgee...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "HasGeek Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hasgeek-code...@googlegroups.com.
To post to this group, send email to hasgee...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Thanks & Regards

"Talk is cheap, show me the code" -- Linus Torvalds
kracekumar

Mitesh Ashar

unread,
Mar 12, 2013, 6:23:56 AM3/12/13
to hasgee...@googlegroups.com

Kiran Jonnalagadda

unread,
Mar 12, 2013, 7:30:28 AM3/12/13
to hasgee...@googlegroups.com
On Tuesday, 12 March 2013 at 3:40 PM, kracekumar ramaraju wrote:
A table with 2 column like urlpath, pkeys_as_dict. E.g: /jsfoo/2013/23-video-name, {'channel': 2, 'playlist': 34, 'video': 23} stored in db. Retrival time is O(1) since it is more like hashtable. Is this how path enumeration or traversal work ? if irrelevant pls ignore.    
Krace, I'm having trouble finding your text among all the quotes. Can you please trim unnecessary quotes?

The path enumeration model is for a self-referencing hierarchy with a single table, but your example has three different models, so it's not the same thing.

Kiran

Kiran Jonnalagadda

unread,
Mar 12, 2013, 7:38:09 AM3/12/13
to hasgee...@googlegroups.com
On Tuesday, 12 March 2013 at 3:23 PM, Mitesh Ashar wrote:
In this scenario, would it be a better option to store data in a graph database like neo4j?
I'm interested in exploring new database architectures in a limited scope, but migrating entire projects is not an option because (a) there are too many unknowns and (b) we don't have the time and resources to experiment.

Version 1 can be ugly and incomplete, but should work, thereby allowing the operations teams to continue their work and freeing up the tech teams to dick around attempting the perfect version 2.

ContactPoint had an inefficient version 1 that worked before we replaced polling with websockets.

Eventframe is similarly a working version 1, but it's far too incomplete, so Nodular is under pressure to replace it fast.

Nodular version 2 is for the experiments with graph databases. We'll get there.

Kiran

Mitesh Ashar

unread,
Mar 12, 2013, 7:51:30 AM3/12/13
to hasgee...@googlegroups.com
Got that.

-- 
Mitesh Ashar
Sent with Sparrow

--

Kiran Jonnalagadda

unread,
Mar 12, 2013, 4:29:38 PM3/12/13
to hasgee...@googlegroups.com
On Tuesday, 12 March 2013 at 3:01 PM, Kiran Jonnalagadda wrote:
The path enumeration pattern, however, tackles exactly this problem. It stores the full path as a column in the row, so it's trivial to retrieve a path or subtree. The two problems with path enumeration (as per Karwin's slide above) -- querying for immediate children and maintaining referential integrity -- are both already solved by the adjacency list pattern.

The path enumeration pattern does have one problem: if a node is renamed, the paths of everything below it have to be amended recursively. Fortunately for us, renames are relatively rare, so it's okay for them to be inefficient.
Path enumeration with recursive path amendment is now in the repo with tests. http://git.io/ViZwQw

Kiran


Kiran Jonnalagadda

unread,
Mar 18, 2013, 1:18:32 PM3/18/13
to hasgee...@googlegroups.com
On Wednesday, 13 March 2013 at 1:59 AM, Kiran Jonnalagadda wrote:
Path enumeration with recursive path amendment is now in the repo with tests. http://git.io/ViZwQw

I just checked in initial support for content with revision history: http://git.io/zI6nrw

Here is the typical use case: when a sponsor signs up for a conference, we ask them to send us a write-up. Someone from the event team becomes responsible for posting this write-up to the website. This is either posted to Eventframe or to a static HTML file as part of the site theme.

The manual workflow is grossly inefficient.

In the new approach, the sponsor can edit their write-up directly in Eventframe, confirm that they like what they are seeing, and then "submit" for "approval" by an event manager (so we can ensure they are not violating the terms of the sponsorship in the text).

Eventframe already has revision history for content, but that was meant for making Undo possible. It had no access control or workflow process, and only the textual content component had revision history.

In the new approach, each revision can have a unique label attached to it, indicating a status of "draft", "pending", "published" or none (for "archived"). There can only be one revision in each of the draft/pending/published states.

Draft state is for content owners to edit. Pending state is a draft submitted for publication. Published state is what regular site visitors see. Archived state is for old revisions that are no longer relevant but haven't been purged from the database.

The new approach makes it possible for all columns in a Node to be revision controlled. This means that for a theoretical "Event" node, even the dates are revision controlled and can be put through a workflow process.

This approach makes it possible to build a rich wiki like CrunchBase where the content has well-defined structure and isn't plain text, and yet has full revision history and workflow.


Part 2: i18n

I've been conflicted for a long time on whether i18n should be part of the design. Should it be possible to support multiple language revisions of a given document?

1. Consider the typical open source project: the app is initially written in English, and with each release a translation team updates the local language version of each English string.

2. Consider Wikipedia: MediaWiki doesn't support multiple versions of the same document, and they figured each language came with its own culture and community of knowledge sharing, so they have a separate instance for each language and inter-language links between pages.

3. Consider GitHub: they figured i18n was a pain and English was the universal language of computer programming, so they gave up on i18n altogether. GitHub is only available in English.

At HasGeek, everything we've done has been in English so far. We've never needed another language. I do not imagine us needing to do a Kannada or Hindi language website anytime.

And yet, our friends in Europe do need to make bilingual websites all the time, and we're trying to make general purpose software. Should we shoehorn i18n into the revision control mechanism?

Today's check-in includes both "language" and "status" columns, with language defaulting to a blank string (indicating language-agnostic content). The unique constraint on "status" covers both language and status. That is, draft/pending/published status tracking is separate per language.

We do not have to do this. Each language edition of a website could simply be another folder, with /en, /de prefixes in the URL, each maintained independently. However, by making i18n a basic feature of nodes, we're making it possible to define a node hierarchy only once and re-use it for every language. I remember this particularly from Plone, where i18n for the basic UI was in the filesytem and was handled incrementally, but i18n for content was an all-or-nothing choice -- you had to commit to redoing the entire website in another language.

Is i18n a fringe benefit that is not worth the pain of addressing every other issue to make it work?

Kiran

PS: For what it's worth, non-revisioned nodes like Folders have no language, so their names and other content are in the user's default language and cannot be customized per language. Only revisioned nodes have language.

Kiran Jonnalagadda

unread,
Mar 19, 2013, 6:56:33 AM3/19/13
to hasgee...@googlegroups.com
Guys, I'll appreciate some feedback on i18n. I'm currently blocked on this design decision.

Kiran

-- 
Kiran Jonnalagadda

Mitesh Ashar

unread,
Mar 19, 2013, 7:07:06 AM3/19/13
to hasgee...@googlegroups.com
According to me it is important to first get the status aspect right by trial and error.

Apparently if i18n is implemented, it will have its own implications and its own set of trials & errors. Since both are a part of the same project, it is always preferable to halt i18n to avoid complications. That also helps by buying in more time to determine whether it is needed or not.

-- 
Mitesh Ashar
Sent with Sparrow

Anand Chitipothu

unread,
Mar 19, 2013, 7:22:12 AM3/19/13
to hasgee...@googlegroups.com
There are three approaches people use for i18n.

1. One subdomain for each language. a la wikipedia. Each site
typically contains different content.
2. One subdir for each language. Many CMS websites use this way. It is
hard to keep content in sync.
3. Multiple languages supported in the same website.

The last one is preferred and a bit tricky to implement. It should
pick the language from the request headers and display the content in
appropriate language.

There are two things that can be translated.
1. The strings used in the application/website
2. The content

The first one is easy to solve. There are tools to extract strings
used in templates and all we need to do is take the template file and
add strings for the target language. Flask-Babel does it for Flask.

Things get tricky when you want to translate the content.

At Open Library, we had pages with /about.en and /about.it etc. and
/about page picks the right document (node in your case) and displays
the content.

Freebase does it differently. Each field in the document/node can take
values in multiple languages.

You can pick whatever works better for you. I think keeping multiple
translations in the same field might create an issue for translators.
You may want to allow permissions to different people for editing
content of each language. I think, keeping each language content in
separate node helps in keeping your permission system simple.

Anand
> --
> You received this message because you are subscribed to the Google Groups
> "HasGeek Code" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hasgeek-code...@googlegroups.com.
> To post to this group, send email to hasgee...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Anand
http://anandology.com/

Devi

unread,
Mar 19, 2013, 8:33:32 AM3/19/13
to HasGeek Code
I second Mitesh's opinion that it's better to wait on i18n till we get
everything else in place. As long as the "status" takes care of
uniqueness on language+status, I think we're good. We could just
proceed with the language as empty string as it is now.


On Mar 18, 10:18 pm, Kiran Jonnalagadda <ki...@hasgeek.com> wrote:
> On Wednesday, 13 March 2013 at 1:59 AM, Kiran Jonnalagadda wrote:
> > Path enumeration with recursive path amendment is now in the repo with tests.http://git.io/ViZwQw

Kiran Jonnalagadda

unread,
Mar 19, 2013, 3:05:39 PM3/19/13
to hasgee...@googlegroups.com
On Tuesday, 19 March 2013 at 4:52 PM, Anand Chitipothu wrote:
At Open Library, we had pages with /about.en and /about.it etc. and
/about page picks the right document (node in your case) and displays
the content.

I really like this idea. It's clean and it keeps the language-specific versions distinct. I think I'll go with it and remove language awareness from the revision level, bumping it up to the node publisher.

Kiran
Reply all
Reply to author
Forward
0 new messages