Nested Sets vs. Nested Intervals

asdf

unread,

Nov 7, 2005, 3:43:28 PM11/7/05

to

Hi!,

I am building a web directory similar to the dmoz and the yahoo
directory. The categories get updated often.

How do I find the subcategories from just the names of the ancester and
the current categories? For example, both the dmoz and yahoo
directories have filesystem-like URLs instead of category IDs. How do I
do that with nested sets or nested intervals? If not, how do they do
it?

There should be a count for how many listings in each subcategory. It
is much more efficient to count how many listings are in each
subcategory with nested sets than the adjacency list. Is there a way to
count subcategories efficently with nested intervals with farey
fractions?

Thank you very much!

Message has been deleted

Vadim Tropashko

unread,

Nov 7, 2005, 6:07:27 PM11/7/05

to

Subcategories can be counted "semi-efficiently", that is you have to
find all the intervals (or just all the left boundary points) that are
contained within the ancestor interval. The speed is proportional to
the size of the subtree.

Nested sets is often quoted as a solution for static hierarchies. It is
rarely mentioned that one can't query ancestors path efficiently.
Admittedly, nested sets counts the descendants nicely as rgt-lft-1, but
"count" is the only hierarchical total aggregate function that nested
sets can do efficiently.

The implementation of ancestor query in nested intervals encoding is
somewhat messy. If we ignore the efficiency, then

select count(1)
from tree t
where compare(rational(a,b), lft) > 0 and
compare(rgt(lft),rational(c,d)) >= 0

should be good enough. The [a/b, c/d] is the encoding of your node,
below which you count the descendants. This is full table scan of the
tree table. This could be improved by adding explicit predicate to turn
the access path to index range scan

select count(1)
from tree t
where compare(rational(a,b), lft) > 0 and
compare(rgt(lft),rational(c,d)) >= 0
and t.lft.num/t.lft.den between
a/b-0.00000001 and c/d+0.00000001

where the magic constant 0.00000001 is designed to offset floating
point numerical imprecision.

-CELKO-

unread,

Nov 8, 2005, 3:03:13 PM11/8/05

to

>> Is there a way to count subcategories efficently with nested intervals with farey fractions? <<

In the nested sets model, (rgt -lft +1 )/2 = size of subtree at this
root node.

amado...@netcabo.pt

unread,

Nov 8, 2005, 4:23:38 PM11/8/05

to

Sorry, please correct my interpretation of your problem, as it sounds
too simple to be true: you have a tree (of categories); given a node (a
category), you want to know its immediate children.

amado...@netcabo.pt

unread,

Nov 9, 2005, 6:06:48 AM11/9/05

to

Oops... you want to count the descendants. Sorry, browser problem. Ok,
to count the descendants, I see you've already got good answers for SQL
systems. Personally I would use a non SQL system, but that's another
story.

/* Namely the story of graph-based vs. relational systems. "Relational
atabases are universally conceived of as an advance over their
predecessors network and hierarchical models." (Vadim on dbazine.com)
This is probably true in the historical context of Ingres vs. Codasyl
etc. but mostly a myth. Unfortunately Codasyl's demise is (incorrectly)
perceived as a weakness of *all* graph-based data systems. (But why are
file systems still hierarchical today?) */

Roy Hann

unread,

Nov 9, 2005, 7:31:53 AM11/9/05

to

<amado...@netcabo.pt> wrote in message
news:1131534408.8...@z14g2000cwz.googlegroups.com...

> (But why are
> file systems still hierarchical today?) */

Why indeed! It doesn't actually make a lot of sense. It is great for some
needs but by no means all. A more versatile organization which could be
viewed as a hierarchy (when that is useful) seems long past due.

Roy

vc

unread,

Nov 9, 2005, 8:04:28 AM11/9/05

to

Is it another case of a "read a book or something" ? Codd's "A
Relational Model of Data for Large Shared Databanks" may be a good
start in order to understand why "graph-based systems" are not a good
approach to managing data.

amado...@netcabo.pt

unread,

Nov 9, 2005, 8:34:34 AM11/9/05

to

<<Is it another case of a "read a book or something" ?>>

No.

<<"graph-based systems" are not a good approach to managing data>>

Repeating a myth does not make it true. I'll gladly discuss any
technical issue.

vc

unread,

Nov 9, 2005, 10:02:04 AM11/9/05

to

Calling Codd's work a myth is cute, but can you offer any technical
argument in favour of "graph-based systems" in addtion to what has been
rehashed on this newsgroup many times and besides "I say so" ?

amado...@netcabo.pt

unread,

Nov 9, 2005, 12:23:44 PM11/9/05

to

> Calling Codd's work a myth is cute, but can you offer any technical
> argument in favour of "graph-based systems" in addtion to what has been
> rehashed on this newsgroup many times and besides "I say so" ?

I don't recall the arguments in Codd's original article of 1970 or so.
Maybe I'll take a relook next time I stop by the University library.
Anyway the following holds.

Some semantic domains are better modelled with graphs and trees than
with relations. The idioms for trees and graphs in SQL is extremely
convoluted to say the least (the word "pathetic" comes to mind). The
original problem in this thread is an concrete example.

Relational star people themselves recognized shortcomings of the
relational theory for semantic modelling. This was a major factor in
the creation of the Entity-Relationship model. And guess what: the E-R
is a graph-based model. (Not to mention UML class diagrams, but I know
what you'll say: these are for programs, not data.)

There's graph-based and graph-based. As you I also don't subscribe to
the socalled "hierarchical" and "network" models as they are often
presented in relational circles. They seem to be a heritage of Codasyl
and alike systems of the 1960's. Fortunately others kept researching
and have come up with much better graph-based models e.g. Resource
Description Framework and even XML.

Mikito Harakiri

unread,

Nov 9, 2005, 1:09:59 PM11/9/05

to

amado...@netcabo.pt wrote:
> Some semantic domains are better modelled with graphs and trees than
> with relations. The idioms for trees and graphs in SQL is extremely
> convoluted to say the least (the word "pathetic" comes to mind). The
> original problem in this thread is an concrete example.

This is because the domain is poorly understood. I would challenge the
idea that XML people have clearer understang of graph problems. Tags
don't magically solve problems.

> Relational star people themselves recognized shortcomings of the
> relational theory for semantic modelling. This was a major factor in
> the creation of the Entity-Relationship model. And guess what: the E-R
> is a graph-based model. (Not to mention UML class diagrams, but I know
> what you'll say: these are for programs, not data.)

Well, the join operation is another binary relation that can be
interpreted as edges in a graph. Excuse me, but that is not a
particularly deep idea. I fail to see it utilized anywhere except query
optimization.

> There's graph-based and graph-based. As you I also don't subscribe to
> the socalled "hierarchical" and "network" models as they are often
> presented in relational circles. They seem to be a heritage of Codasyl
> and alike systems of the 1960's. Fortunately others kept researching
> and have come up with much better graph-based models e.g. Resource
> Description Framework and even XML.

It is not a matter of "do hierarchical and network models make sence".
It's a matter of how nice is the underlying theory. So far relational
people seems to be unimpressed.

vc

unread,

Nov 9, 2005, 1:20:30 PM11/9/05

to

amado.al...@netcabo.pt wrote:
> > Calling Codd's work a myth is cute, but can you offer any technical
> > argument in favour of "graph-based systems" in addtion to what has been
> > rehashed on this newsgroup many times and besides "I say so" ?
>
> I don't recall the arguments in Codd's original article of 1970 or so.
> Maybe I'll take a relook next time I stop by the University library.
> Anyway the following holds.
>
> Some semantic domains are better modelled with graphs and trees than
> with relations.

Before going any farther, let's get our definitions straight in order
to prevent mutual misunderstanding. Please define the "semantic
domain".

>The idioms for trees and graphs in SQL is extremely
> convoluted to say the least (the word "pathetic" comes to mind).

The tools for treating tree-like structures in RDBs are pretty
straightforward, there is nothing "convoluted" or "pathetic" about
them. In addition to what has already been suggested, every major
database has some recursive query implementation that lets the OP solve
his problem easily.

>The
> original problem in this thread is an concrete example.

See above.

>
> Relational star people themselves recognized shortcomings of the
> relational theory for semantic modelling.

Who are "relational star people" ? Also, please define "semantic
modelling" .

> This was a major factor in
> the creation of the Entity-Relationship model. And guess what: the E-R
> is a graph-based model.

So what ? Chen's E-R model is *not* the relational model . E-R models
can be mapped to relational models or non-relational models.

>(Not to mention UML class diagrams, but I know
> what you'll say: these are for programs, not data.)

Let's not mention UML ;)

>
> There's graph-based and graph-based. As you I also don't subscribe to
> the socalled "hierarchical" and "network" models as they are often
> presented in relational circles.

If you do not subscribe to, say, the network model, then you can
probably spell out how the/(a) graph-based model is different from the
network model.

>They seem to be a heritage of Codasyl
> and alike systems of the 1960's. Fortunately others kept researching
> and have come up with much better graph-based models e.g. Resource
> Description Framework and even XML.

How is RDF or XML different from the network model ?

asdf

unread,

Nov 9, 2005, 4:01:46 PM11/9/05

to

Thank you very much, Vadim Tropashko.

Nested Intervals are very efficient for updating and selecting
hierarchical categories. I am creating a web application/directory from
it, but it would be more convenient for the user to retrieve its
children from an array of folder names in the URL (seperated by dots or
slashes, like google groups does for its usenet categories) instead of
a category_id.

Both dmoz and yahoo also retrieves its immediate subcategories by the
array of folder-names in the URL; Unlike some other directories that
retrives its immediate subcategories from a category_id in the URL. Is
there any efficient way to retrieve its immediate subcategories by an
array of folder-names such as dmoz and yahoo?

Basically, it's like replicating a file system: an array of folder
names can point to the category_id to retrieve their children. However,
it would take recursive SELECTs to find the category_id from the array
of folder names. I could store the whole path in each row, but it would
be very inefficient to update a category. Is there an efficient way to
do this?

Thank you very much!

vc

unread,

Nov 9, 2005, 4:08:02 PM11/9/05

to

asdf wrote:
> Thank you very much, Vadim Tropashko.
>
> Nested Intervals are very efficient for updating and selecting
> hierarchical categories. I am creating a web application/directory from
> it, but it would be more convenient for the user to retrieve its
> children from an array of folder names in the URL (seperated by dots or
> slashes, like google groups does for its usenet categories) instead of
> a category_id.
>
> Both dmoz and yahoo also retrieves its immediate subcategories by the
> array of folder-names in the URL; Unlike some other directories that
> retrives its immediate subcategories from a category_id in the URL. Is
> there any efficient way to retrieve its immediate subcategories by an
> array of folder-names such as dmoz and yahoo?
>
> Basically, it's like replicating a file system: an array of folder
> names can point to the category_id to retrieve their children. However,
> it would take recursive SELECTs to find the category_id from the array
> of folder names.

So why don't you just use a simple adjacency list and a recursive query
instead ? What database are you using ?

asdf

unread,

Nov 9, 2005, 4:24:16 PM11/9/05

to

I use MySQL. The inefficiency for the adjacency list method grows
exponentially with the amount of categories.

vc

unread,

Nov 9, 2005, 4:42:47 PM11/9/05

to

asdf wrote:
> I use MySQL. The inefficiency for the adjacency list method grows
> exponentially with the amount of categories.

It's unclear how one can achieve an exponential performance degradation
with an adjacency list traversal. Could you post the piece of code you
use to do that ?

There are quite a few tree traversal implementations for MS SQL Server
2000 on the Internet. You can google for "SQL Server" and "tree
traversal".

amado...@netcabo.pt

unread,

Nov 9, 2005, 5:24:03 PM11/9/05

to

Mikito, the domains are not poorly understood, the join operation per
se does not give you a graph, the underlying theories are very nice
thank you, see e.g. RDF, XML, XQuery and XPath Data Model, and the
non-relational algebras XQuery Algebra, TAX, RAL etc.

amado...@netcabo.pt

unread,

Nov 9, 2005, 5:54:49 PM11/9/05

to

Semantic domain is as you know the fancy name for the "things" you
represent with your schema and data.

Relational *key* people (sorry, "star" was bad choice of word, unless I
cited Kleene too:-) you know and cited two, Codd, Chen, and then
there's Date, Ulman, Wiederhold, etc. and I would not mind citing here
the people siting at the W3C working groups for XQuery, XPath, etc.

E-R maps better to graph-based representations than to relational ones.
It is not a matter of "can" but of "how well."

No I won't mention UML nor say the dirty word ("semi-structured":-)

There's lots of (good) graph-based models. RDF is (very) different from
XML and both are (very) different from Codasyl and my baby called
Mneson is (very, not so much, less) different from Codasyl, XML, RDF.
(I suggest Mneson be discuss in the thread "Implementing a graph
algebra" because that's it)

The "network" model in particular, or Codasyl: nodes are tables with
records and attributes. Not a pure graph. Actually not very graph-like
at all. I think the technical term is "a dirty mess."

How is RDF or XML different from Codasyl in particular: a lot
different. An RDF node is a single concept, not a table like in
Codasyl. An RDF edge is a single property out of a possibly (depending
on the meta-schema) open set of properties, not the single relation
"child" of Codasyl. Etc. RDF is a much cleaner model. A true graph
model. This has good consequences at many levels (theory, practice,
adoption). For example you can find very nice algebras for RDF out
there.

Mikito Harakiri

unread,

Nov 9, 2005, 7:00:22 PM11/9/05

to

I assume that you asked how to query immediate descendants (children!)
in the nested intervals model. I've emailed the chapter snippet on
nested interval encoding to your yahoo email address. Look up page 15
where it develops that nested intervals matrix encoding is an adjacency
model.

mic...@preece.net

unread,

Nov 9, 2005, 7:30:04 PM11/9/05

to

I can't understand why no one has suggested Pick for this. Seems
perfectly simple to me. Anyone who has used Pick would not think twice
about this. The solution to the OP's problem would be trivial in the
extreme.

Mike.

-CELKO-

unread,

Nov 9, 2005, 8:19:10 PM11/9/05

to

>> Nested sets is often quoted as a solution for static hierarchies. It is rarely mentioned that one can't query ancestors path efficiently. <<

An employee and all their Supervisors, no matter how deep the
organizational tree, with a column for sorting the path from root to a
node:

SELECT O2.emp, (O2.rgt- O2.lft) AS path_order
FROM OrgChart AS O1, OrgChart AS O2
WHERE O1.lft BETWEEN O2.lft AND O2.rgt
AND O1.emp = @myemployee
ORDER BY path_order DESC;

VC

unread,

Nov 9, 2005, 8:22:17 PM11/9/05

to

<amado...@netcabo.pt> wrote in message
news:1131576889.9...@z14g2000cwz.googlegroups.com...

> Semantic domain is as you know the fancy name for the "things" you
> represent with your schema and data.

So what's the difference between a domain and a semantic domain ? If there
is none, why muddy waters with the fancy names like "semantic" ?

>
> Relational *key* people (sorry, "star" was bad choice of word, unless I
> cited Kleene too:-) you know and cited two, Codd, Chen, and then
> there's Date, Ulman, Wiederhold, etc. and I would not mind citing here
> the people siting at the W3C working groups for XQuery, XPath, etc.
>
> E-R maps better to graph-based representations than to relational ones.
> It is not a matter of "can" but of "how well."
>
> No I won't mention UML nor say the dirty word ("semi-structured":-)
>
> There's lots of (good) graph-based models. RDF is (very) different from
> XML and both are (very) different from Codasyl and my baby called
> Mneson is (very, not so much, less) different from Codasyl, XML, RDF.
> (I suggest Mneson be discuss in the thread "Implementing a graph
> algebra" because that's it)

Could you pleas give a definition of the graph algebra ?

>
> The "network" model in particular, or Codasyl: nodes are tables with
> records and attributes. Not a pure graph. Actually not very graph-like
> at all. I think the technical term is "a dirty mess."
>
> How is RDF or XML different from Codasyl in particular: a lot
> different. An RDF node is a single concept, not a table like in
> Codasyl. An RDF edge is a single property out of a possibly (depending
> on the meta-schema) open set of properties, not the single relation
> "child" of Codasyl. Etc. RDF is a much cleaner model. A true graph
> model. This has good consequences at many levels (theory, practice,
> adoption). For example you can find very nice algebras for RDF out
> there.

In order to be more specific, would you please elaborate on the RDF algebra.
like, give a definition of the algebra ?

>

Vadim Tropashko

unread,

Nov 9, 2005, 8:56:30 PM11/9/05

to

Joe,

We have already discussed this. The predicate

O1.lft BETWEEN O2.lft AND O2.rgt

looks simple, but is assymmetric regarding O1 and O2. In the acces path
driven from O2 to O1, the O1 set could be found by index range scan.
Find the first node in the b-tree satisfying

O1.lft BETWEEN O2.lft AND O2.rgt

then go to the next b-tree leaf, and so on. This is very efficient.

In contrast, the access path driven from O1 to O2 you have either to
scan the whole table O2, or use index range scan that would scan an
interval with one boundary condition only

O1.lft >= O2.lft

or

O1.lft <= O2.rgt

which would scan large portion of the index. This is (relatively)
inefficient.

Mikito Harakiri

unread,

Nov 9, 2005, 9:11:48 PM11/9/05

to

amado...@netcabo.pt wrote:
> Mikito, ..., the join operation per
> se does not give you a graph, ...

Take a set of all tables as a set of vertices. Take a set of all
foreign key integrity constraints as a set of edges. Do you challenge
the idea that I just defined a graph?

Alexandr Savinov

unread,

Nov 10, 2005, 4:14:48 AM11/10/05

to

Mikito Harakiri schrieb:

Yes, but in such a formulation it is not RM! It is a graph model
implemented via RM.

Nobody argues that many things can be implemented via RM including
graph-based models. I could even imagine that someone could implement
OOP or AOP techniques using RM. Or, vice versa, if I implement a
behaviour of RM using Turing machine then it does not mean that Turing
has invented RM.

The question is (as far as I understand it):

Is it more appropriate to view the (data) world as a set of relations
(with some operations, constraints etc.) or as a graph?

--
http://conceptoriented.com

amado...@netcabo.pt

unread,

Nov 10, 2005, 6:57:07 AM11/10/05

to

<<In order to be more specific, would you please elaborate on the RDF
algebra.
like, give a definition of the algebra ?>>

An RDF algebra is given in "RAL: an Algebra for Querying RDF" by
Fransicar et al. There are others. I remember seing very nice
discussions in the proceedings of the WWW conference some years ago.

For XML see: "TAX: A Tree Algebra for XML" by Jagadish et al. and
"XQuery 1.0 and XPath 2.0 Formal Semantics" by the W3C, which
supersedes "The XML Query Algebra."

For my own efforts see thread "Implementing a graph algebra."

Please note that I do not particulary favour RDF. I am interested in
more general languages, ideally one capable of solving the wild
disparity of Semantic Web languages found today. Maybe a bit like what
Codd did in the 1970's w.r.t. the disparity of database models.

vc

unread,

Nov 10, 2005, 8:56:38 AM11/10/05

to

amado.al...@netcabo.pt wrote:
> <<In order to be more specific, would you please elaborate on the RDF
> algebra.
> like, give a definition of the algebra ?>>
>
> An RDF algebra is given in "RAL: an Algebra for Querying RDF" by
> Fransicar et al. There are others. I remember seing very nice
> discussions in the proceedings of the WWW conference some years ago.
>

No, that won't do. Please give your own RDF algebra definition, not
necessarily formal. Surely it cannot be that complicated.

> For XML see: "TAX: A Tree Algebra for XML" by Jagadish et al. and
> "XQuery 1.0 and XPath 2.0 Formal Semantics" by the W3C, which
> supersedes "The XML Query Algebra."
>
> For my own efforts see thread "Implementing a graph algebra."

You did not provide any definition of the algebra. Could you please do
it now ?

>
> Please note that I do not particulary favour RDF. I am interested in
> more general languages, ideally one capable of solving the wild
> disparity of Semantic Web languages found today.

I do not understand what you mean.

> Maybe a bit like what
> Codd did in the 1970's w.r.t. the disparity of database models.

Also, you did not clarify what you've meant by a "semantic domain" as
opposed to just a "domain". Could you offer a clarification ?

Mikito Harakiri

unread,

Nov 10, 2005, 11:56:31 AM11/10/05

to

Alexandr Savinov wrote:
> Mikito Harakiri schrieb:
> > amado...@netcabo.pt wrote:
> >
> > Take a set of all tables as a set of vertices. Take a set of all
> > foreign key integrity constraints as a set of edges. Do you challenge
> > the idea that I just defined a graph?
>
> Yes, but in such a formulation it is not RM! It is a graph model
> implemented via RM.

I was merely suggesting that there is no more substance to the
Entity-Relationship model than in the above graph definition.

amado...@netcabo.pt

unread,

Nov 10, 2005, 12:59:50 PM11/10/05

to

I'll not give an algebra for RDF here since RAL is readily available
and I have my own algebra to worry about. So I'll give you a sketch of
the latter instead.

Mneson is a graph-based metamodel of data, based on a graph with the
following properties:

1. Connections are binary, directed, and untyped i.e. unlabelled. Note
that being untyped entails not having an explicit identifier. A
connection is identified solely by the (ordered) pair of vertices it
represents. Hence, there are no duplicate connections. That is, for any
pair of vertices (x, y) there can be at most one connection from x to
y. For a connection from x to y we say that x is a "source" of y and y
is a "target" of x.

2. A vertex is either a datum or a pivot.

(a) A data vertex or datum represents an atomic value of data e.g. an
integer, a real number, a symbol. The value of a data vertex is the
identity of the vertex.

(b) A pivot has no intrinsic value but it has nevertheless a unique
identity. In practice a serial number is used to identify pivot
vertices.

3. There are no island vertices. There may be, however, island
subgraphs.

A database instance is expressed in the base structure by means of
special subgraph topologies that represent complex data entities like
attributes, lists, sets, etc. [1] A possible definition is as follows:

Attribute. An attribute of name x1 and value x2 is represented by an
attribute instance x3 such that x1 connects to x3 and x3 connects to
x2. The attribute is connected to its subject x0 by connecting x0 to
x3. For example suppose that x5 represents an employee; the fact that
x5 has Social Security Number, or SSN (the attribute name) 123456789
(the attribute value) is expressed by connecting: x5 to x6, "SSN" to
x6, x6 to 123456789.

Set. For a vertex x that represents a set, the targets of x represent
the members of the set. In parlance we say x is the 'head' of the set.
Note this is a different object than the algebraic type set of vertices
below.

Etc.

The topologies may overlap, e.g. a vertex x that is an attribute type
may also be the member of a set, or even a set itself in which case the
respective attribute instance is a member of x.

An algebra (called the Mneson Calculus) is defined on the base graph as
follows. The algebraic type is the set of vertices. The operations
include the set operations (intersection, union, etc.) and the (unary)
adjacency operators T, S for Targets, Sources, with the obvious
semantics:

T (X) = {y : y is a target of x, x in X}
S (X) = {y : y is a source of x, x in X}

The Mneson Calculus serves as a query language. For example, the
following query retrieves the employee with SSN 123456789:

T ({Employees}) ^ S [T ({"SSN"}) ^ S({123456789})]

where ^ means set intersection.

Singleton arguments are very common so we take the liberty of ommiting
the {}, for clarity; and we want to focus on algebraic issues so often
we also ommit the "" around literals. With these simplifications the
expression above becomes

T (Employees) ^ S [T (SSN) ^ S (123456789)]

This example of course assumes a schema where vertex Employees is the
head of the set of employees.
________
[1] These entities are also called "semantic" structures because they
are designed to model semantic domains. Note how the semantic
structures are in turn modelled in terms of the base untyped graph. So
formally we have a chain of domains here and we say "semantic domain"
do describe the former level. Sometimes I say "semantic domain" even
when it is not necessary to distinguish, for habit I guess, I am sorry
if I sound pedantic, please just ignore the word on those occasions.

vc

unread,

Nov 10, 2005, 3:30:57 PM11/10/05

to

Do you mean to say that the entity is presented as a graph whose nodes
are values of some type and edges are directed paths with two paths (S
and T) per each node ? If so, how is it different or better that the
traditional network model ?

amado...@netcabo.pt

unread,

Nov 10, 2005, 5:07:53 PM11/10/05

to

"Do you mean to say that the entity is presented as a graph whose nodes

are values of some type and edges are directed paths with two paths (S
and T) per each node?"

No. A node is either a datum or a pivot. An edge is directed. S, T
means the sources, targets of a node, if any. A node may have any
number of M sources, any number N of targets, as long as M + N > 0.

For a connection from x to y we say x is a source of y, y is a target
of x.

S, T are also called A+, A- in the literature (A from "Adjacent").

The "traditional" network model being Codasyl, nodes are tables, and
edges have a predefined semantic value, member I think, I don't have
the reference at hand, anyway very different (and anyway I'm even less
interested in Codasyl than in RDF).

VC

unread,

Nov 10, 2005, 5:25:31 PM11/10/05

to

<amado...@netcabo.pt> wrote in message
news:1131660473.5...@g43g2000cwa.googlegroups.com...

> "Do you mean to say that the entity is presented as a graph whose nodes
>
> are values of some type and edges are directed paths with two paths (S
> and T) per each node?"
>
> No. A node is either a datum or a pivot. An edge is directed. S, T
> means the sources, targets of a node, if any. A node may have any
> number of M sources, any number N of targets, as long as M + N > 0.

So how would you represent , say, three entities like 'employee',
'department', 'company' ?

amado...@netcabo.pt

unread,

Nov 10, 2005, 7:07:56 PM11/10/05

to

Each single concept is represented by a vertex. The properties of the
concept are represented by topologies involving the vertex. The fact
that employee x10 works in the accounting department may be modelled as

Employees => x10
x10 => x11
WorksIn => x11
x11 => Accounting

where x => y means connected from x to y. Note that this model makes
similarly possible and simple to query for where does x10 work

T (T (x10) ^ T (WorksIn))

and for all workers in accounting

T (Employees) ^ S [S (Accounting) ^ T (WorksIn)]

(A more demanding schema would require a department be also a pivot, so
it could have attributes attached, and be a member of a Departments
like employee x10 is a member of Employees.)

Mikito Harakiri

unread,

Nov 10, 2005, 7:39:36 PM11/10/05

to

amado...@netcabo.pt wrote:
> Each single concept is represented by a vertex. The properties of the
> concept are represented by topologies involving the vertex. The fact
> that employee x10 works in the accounting department may be modelled as
>
> Employees => x10
> x10 => x11
> WorksIn => x11
> x11 => Accounting
>
> where x => y means connected from x to y.

Before going to queries, could you please work out, that is demontrate
what the graph is for the following two relations

Employees
Name Dept Salary
------ ------ -------
Smith 1 1000
Scott 2 2000
King 2 3000

Departments
Deptno Name
------ ------
1 Accounting
2 Manufacturing

amado...@netcabo.pt

unread,

Nov 10, 2005, 8:10:19 PM11/10/05

to

In the Dot language (Graphviz from AT&T):

digraph {
Employees -> x1
Employees -> x2
Employees -> x3
Name -> x4
Name -> x5
Name -> x6
Dept -> x7
Dept -> x8
Dept -> x9
Salary -> x10
Salary -> x11
Salary -> x12
Departments -> x13
Departments -> x14
Deptno -> x15
Deptno -> x16
Name -> x17
Name -> x18
x1 -> x4 -> Smith
x1 -> x7 -> x13
x1 -> x10 -> 1000
x2 -> x5 -> Scott
x2 -> x8 -> x14
x2 -> x11 -> 2000
x3 -> x6 -> King
x3 -> x9 -> x14
x3 -> x12 -> 3000
x13 -> x15 -> 1
x13 -> x17 -> Accounting
x14 -> x16 -> 2
x14 -> x18 -> Manufacturing
}

Note this is a canonical, literal translation of the given relational
database. Modelling the semantic domain directly on Mneson would
simplify things. For example the Deptno is probably not necessary
semantically but just a requirement of the relational model for a key,
or for the connection (join), or both. Mneson does not need keys. A
distinct entity is a distinct vertex. Connected entities simply share a
topology.

VC

unread,

Nov 10, 2005, 8:21:21 PM11/10/05

to

<amado...@netcabo.pt> wrote in message
news:1131667676....@g47g2000cwa.googlegroups.com...

> Each single concept is represented by a vertex. The properties of the
> concept are represented by topologies involving the vertex. The fact
> that employee x10 works in the accounting department may be modelled as
>
> Employees => x10
> x10 => x11
> WorksIn => x11
> x11 => Accounting
>
> where x => y means connected from x to y. Note that this model makes
> similarly possible and simple to query for where does x10 work
>
> T (T (x10) ^ T (WorksIn))
>
> and for all workers in accounting
>
> T (Employees) ^ S [S (Accounting) ^ T (WorksIn)]

Ok, I see. Now, please explain the following:

1. How is the graph model navigation different from the old Codasyl network
model pointer chasing, as what you apparently do is:

a. go one step down from the Employee node and collect all its children
b. go one step up from the Accounting node and collect all its "parents"
c. go one step down from WorkIn and collect all its children
d. find the intersection of sets from step (b) and (c)
e. go one step up for each element in the set constructed at step (d)
f. find the intersection of sets constructed at step (a) and (e)

The only novelty in comparison to the old network model appears to be the
ability to use sets of nodes and set operations on them (which is not quite
a novelty since as far as I remember OMG's OQL could do the same).

2. The structure you've described is not an algebra because it's not closed
over the operations you've defined (compare to the relational algebra).

3. It's unclear how you handle attributes like age or salary (find all the
employees in accounts whose age is between 30 and 40 and salary is less than
40,000).

Mikito Harakiri

unread,

Nov 10, 2005, 8:42:38 PM11/10/05

to

I'm sorry, I started to draw the graph until I saw

x1 -> x4 -> Smith

What is this -- an abbreviation for 2 adjacent edges?
x1 -> x4
x4 -> Smith
Or some kind of labeled edge?

> Note this is a canonical, literal translation of the given relational
> database. Modelling the semantic domain directly on Mneson would
> simplify things. For example the Deptno is probably not necessary
> semantically but just a requirement of the relational model for a key,
> or for the connection (join), or both. Mneson does not need keys. A
> distinct entity is a distinct vertex. Connected entities simply share a
> topology.

I agree, you can remove the Deptno -- I will be able to understand the
model without it.

amado...@netcabo.pt

unread,

Nov 10, 2005, 9:09:52 PM11/10/05

to

x -> y -> z

is an abbreviation for

x -> y
y -> z

dawn

unread,

Nov 10, 2005, 9:18:02 PM11/10/05

to

Of course you are right, Mike. smiles. --dawn

amado...@netcabo.pt

unread,

Nov 10, 2005, 10:02:24 PM11/10/05

to

Mneson is not Codasyl nor OQL based on the descriptions I have of
those.

I believe the Mneson Calculus is an algebra. I'm still working on it.
Example laws:

T ({x1, x2, ...}) = T({x1}) U T({x2}) U ...
T ({}) = {}
T (XUY) = T(X) U T(Y)
T (X^Y) = T(X) ^ T(Y)
Etc.

Conditions on values are made possible by introducing value ranges, set
of values expressed in a compact way, typically as an interval or a
bound. Querying for employees earning less than 40,000:

T (Employees) ^ S (S (T (T (Salary)) ^ RangeForLessThan (40'000)))

Of course query optimization would kick in here, and that's the main
purpose of having the algebra.

Mikito Harakiri

unread,

Nov 10, 2005, 10:47:44 PM11/10/05

to

Ok, so literal translation of 2 minuscule relations 3x3 and 2x2 gives a
graph with 44 edges. Not bad for a start. And network/hierarchy people
are complining that modelling a graph in relational model with a puny

table Edges (
fromNode integer,
toNode integer
)

is unwieldy!

Could you please provide a more compact graph (by Nelson, Carson or
whatever other semantic metadology) which is perhaps a little bit more
impressive?

VC

unread,

Nov 10, 2005, 11:13:20 PM11/10/05

to

<amado...@netcabo.pt> wrote in message
news:1131674470....@g43g2000cwa.googlegroups.com...

> Mneson is not Codasyl nor OQL based on the descriptions I have of
> those.

I asked to explain in what specific ways the graph structure is different
from the network model. So far I do not see any.

>
> I believe the Mneson Calculus is an algebra.
>I'm still working on it.
> Example laws:
>
> T ({x1, x2, ...}) = T({x1}) U T({x2}) U ...
> T ({}) = {}
> T (XUY) = T(X) U T(Y)
> T (X^Y) = T(X) ^ T(Y)
> Etc.

Let's assume your algebraic set consists of vertices. The T/S operators
create sets of vertices that are not elements of the algebraic set (or if
you say they are, the intersection or union operator clearly cannot be
applied to vertices). It's obvious that the graph structure is not an
algebra (cf. the relational algebra).

>
> Conditions on values are made possible by introducing value ranges, set
> of values expressed in a compact way, typically as an interval or a
> bound. Querying for employees earning less than 40,000:
>
> T (Employees) ^ S (S (T (T (Salary)) ^ RangeForLessThan (40'000)))
>

So instead of using a trivial condition like 'salary < 40,000', would one
need to add a special entity (node) 'RangeForLessThan(constant) to the data
model for every imaginable constant ? I do not think it can be called an
improvement over what we have now. If you think it can, please show how it
can be an improvement.

Alexandr Savinov

unread,

Nov 11, 2005, 5:35:22 AM11/11/05

to

amado...@netcabo.pt schrieb:

Assume we have an arbitrary graph. Are there any rules for determining
the role of its nodes such as collection (Employess, Departments),
instance ('John Smith', 'Accounting Department'), properties (age,
salary) etc?

Can these roles be mixed, for example, a node could play a role of
property with respect to one subgraph and an entity for another subgraph?

Is it true that this database engine does not know anything about the
roles of nodes and manipulates only the graph using some formalism?

Is there any informal interpretation of x -> y connection which would
allow me to build a graph given some problem domain description?

--
http://conceptoriented.com

amado...@netcabo.pt

unread,

Nov 11, 2005, 5:54:22 AM11/11/05

to

"...a special entity (node) 'RangeForLessThan(constant) to the data

model for every imaginable constant ?"

Of course not. Even RangeForLessThan is clearly to see that it is an
function.

Note that actually using ranges instead of boolean conditions keeps the
language simpler, because you don't introduce boolean conditions. A
range function is an algebraic operator like T, S.

You you're just trying to find problems in my system let me save you
some time: Mneson Calculus expressions for complex queries are hard to
follow and construct; a better query language with proper linguistic
devices for the semantic structures must be selected or created; then
mapped onto Mneson Calculus for query planning, optimization,
evaluation; here operations must be optimized; I have suceeded in
optimizing expressions like T(K1)^T(K2) where K1, K2 are known (e.g.
constants) but I am still looking for ways to optimize expressions like
T(T(...))

amado...@netcabo.pt

unread,

Nov 11, 2005, 6:06:18 AM11/11/05

to

"Let's assume your algebraic set consists of vertices."

Let's not. The algebraic *type* is the set of vertices. That means the
algebraic set is the set of sets of vertices. The notation T(k) where k
is a constant is just an abreviation for T({k}). I've said all this
before.

/*
Maybe the abbreviation T{k} would be better. Actually I have
considered: T<string> for T({"string"}), T[number] for T({number}).
*/

amado...@netcabo.pt

unread,

Nov 11, 2005, 6:16:06 AM11/11/05

to

"I asked to explain in what specific ways the graph structure is
different
from the network model."

If what you denote by "network model" is some system equal or very
similar to mine then please point to or give a description of it with
the same rigour as I've just done for mine. I'd be very interested in
that system.

vc

unread,

Nov 11, 2005, 8:46:40 AM11/11/05

to

amado...@netcabo.pt wrote:
> "Let's assume your algebraic set consists of vertices."
>
> Let's not. The algebraic *type* is the set of vertices. That means the
> algebraic set is the set of sets of vertices. The notation T(k) where k
> is a constant is just an abreviation for T({k}). I've said all this

Ok, so you have sort of graph vertices algebra if you present each
node as a one element set. However, it does not qualify for being
called a graph algebra since you have no operations defined for graphs
themselves, you have them only for vertices.

To clarify, your database schema can be seen as one large graph.
Clearly, the real world entity is modelled as a subgraph. So what
operations do you define to manipulate/query such subgraphs ? How does
one construct a new entity (a subgraph in your model) as easily as can
be done in the RA with its joins/projections/etc. ?

An analogy might help. Let's say you have a model with the classical
database relations but no relational operators like joins and such
defined for the relations. Can you seriously claim that you have a
sort of algebra just by virtue of the relation attributes belonging to
the domains with closed operations (like set of integers closed under
multiplication/addition) ?

I am not trying find faults in your system, but what I see is a graph
model virtually indistinguishable from the Codasyl network model with
all the usual problems, like application dependency, navigational
quering, lack of query algebra, etc. (as are XML "model", RDF and other
proposals of the kind).

amado...@netcabo.pt

unread,

Nov 11, 2005, 9:21:42 AM11/11/05

to

The Mneson Calculus is just a query language. At least as presented so
far. Note these things are a bit in flux. To manipulate the graph the
primitives are Connect (x, y), Disconnect (x, y) with the obvious
semantics. With these primitives any graph may be constructed. With the
Mneson Calculus any selection may be retrieved. Together with the
topologies for semantic structures these form the underpinnings of
Mneson. A host of languages can be created upon them. My personal
prototype of Mneson is an Ada library. Then I use Ada normally, to glue
querying and construction when needed, and make applications.

<<... what I see is a graph

model virtually indistinguishable from the Codasyl network model with
all the usual problems, like application dependency, navigational
quering, lack of query algebra, etc. (as are XML "model", RDF and other

proposals of the kind). >>

Then you see problems that do not exist, sorry. Clearly you are still
very under the Myth.

amado...@netcabo.pt

unread,

Nov 12, 2005, 8:54:21 AM11/12/05

to

> Could you please provide a more compact graph (by Nelson, Carson or
> whatever other semantic metadology) which is perhaps a little bit more
> impressive?

Sorry for the late response.

Yes, it is essential to provide more compact representations, for
semantic structures, than the Mneson Calculus or the base graph, for
cases where the MC or the base graph inflates the number of calculus or
graph elements. One such case is the attribute structure, so I'll
discuss this case briefly. Note it equates a labeled edge.

attr. name
subject -------------> value

Replacing the attribute topologies in the base graph for labelled edges
should reduce the edge count by 2 or 3 times. Try it.

Also textual languages may be created that express the semantic
structures directly e.g.

attribute (subject, attr_name, value);

which can be defined with the primitives

new_vertex (x);
connect (subject, x);
connect (attr_name, x);
connect (x, value);

The base graph serves to represent all different semantic structures in
the same formalism, and the Mneson Calculus (operating on the base
graph) is supposed to serve as the query plan and optimization
component. As I said in another message, also higher level query
languages should exist that map into the Mneson Calculus e.g. the very
common query pattern

members of set X having attribute Y with value Z

of which we have seen an instance before

T(Employees) ^ S[T(SSN) ^ S(123456789)]

could be a 'macro' or some such, with translation to the Mneson
Calculus

TX ^ S(TY ^ SZ)

Remember ^ means intersection.

In meta-expressions (with variables only) I drop the functional
parenthesis.

BTW a possible optimization of the above involves the derivation

TX ^ S(TY ^ SZ)
= TX ^ STY ^ SSZ
= T{x1} ^ ... ^ T{xm} ^ ST{y1} ^ ... ^ ST{yn} ^ SS{z1} ^ ... ^ SS{zl}

(where X = {x1, ..., xm}, etc.) because there are efficient methods to
compute the n-ary intersection of A{x} terms, where A is T or S; I'm
studying their extension to include AA{x} terms, see the thread
"Implementing a graph algebra"

I often use the Mneson Calculus and the base graph directly when
modelling examples for a number of reasons:

- some very common semantic structures e.g. set map directly to a base
structure (untyped link)
- it allows to mix structures
- I don't have a higher level device of choice yet
- I'm familiar with the MC and the base structure and the topologies
- I'm studying the MC as a query planning and optimization device

amado...@netcabo.pt

unread,

Nov 12, 2005, 1:30:17 PM11/12/05

to

"Are there any rules for determining
the role of its nodes such as collection (Employess, Departments),
instance ('John Smith', 'Accounting Department'), properties (age,
salary) etc?"

A branch of the graph could contain the shema. I also favour shemaless
idoms a la XML. In this case canonical topologies are used. And a
canonical semantic model with a focus on sets and attributes. This is a
matter of convention I guess. Often times I design a Mneson database
instance with a Top vertex leading to all concepts and sets as named
attributes of Top. So a new user inspecting the attribute names of Top
would get the schema. Often times an attribute of Top named "types" is
the set of all attribute types (=attribute names) in the database. This
is the catalog problem occuring also in relational systems.

"Can these roles be mixed, for example, a node could play a role of
property with respect to one subgraph and an entity for another
subgraph?"

Absolutely and this is I believe one of the strenghs of the model.

"Is it true that this database engine does not know anything about the
roles of nodes and manipulates only the graph using some formalism?"

The system knows about the canonical structures so it can process
canonical graphs in a clever way e.g. by displaying attributes as
labelled edges.

"Is there any informal interpretation of x -> y connection which would
allow me to build a graph given some problem domain description?"

The canonical interpretations of x -> y are

(1) if x is the head of a set, y is a member

(2) y is a property of x

(3) the case [attribute type (or name) -> attribute instance] it is
just a convention, however one could say an attribute instance is a
member of the attribute type, seing the attribute type as a set, so
this case would be derived from (1)

Note how (1-2) combine neatly to form a record = set of attributes.

(Excellent questions you all, thank you, sorry for the late responses,
and please if I fail to respond drop me a note directly, it might be
because I failed to detect the question as I'm just browsing in Google
Groops.)

Mikito Harakiri

unread,

Nov 12, 2005, 4:40:00 PM11/12/05

to

amado...@netcabo.pt wrote:
> Yes, it is essential to provide more compact representations, for
> semantic structures, than the Mneson Calculus or the base graph, for
> cases where the MC or the base graph inflates the number of calculus or
> graph elements. One such case is the attribute structure, so I'll
> discuss this case briefly. Note it equates a labeled edge.
>
> attr. name
> subject -------------> value
>
> Replacing the attribute topologies in the base graph for labelled edges
> should reduce the edge count by 2 or 3 times. Try it.

A labeled edge

attr. name
subject -------------> value

give rise to ambiguity

subject
attr.name -------------> value

(and all the other possible permutations of the 3 elements).

Well, actually I found some examples on your slides
http://www.liacc.up.pt/~maa/mneson/unh.pdf
These graphs seems to be subgraphs of lattice structures from formal
concept analysis.
http://www.upriss.org.uk/papers/arist.pdf
Note that objects are positions in the bottom of the graph on figure 2
as the lattice atomic elements. The attributes are on the top. You have
attributes on the left, and objects on the right.

Nobody said the other models don't have the right to exist. Hey, there
is even connection of formal concept analysis to relational algebra. It
is simply that the alternative models are less mature and never enjoyed
the same level of success as RM.

asdf

unread,

Nov 12, 2005, 5:04:43 PM11/12/05

to

Inserting nodes in: http://arxiv.org/html/cs.DB/0401014 without
matrices don't work. How do I insert a node in a parent node without
matrices?

asdf

unread,

Nov 12, 2005, 5:07:06 PM11/12/05

to

the inserting examples in http://arxiv.org/html/cs.DB/0401014 don't work

Vadim Tropashko

unread,

Nov 12, 2005, 5:49:34 PM11/12/05

to

asdf wrote:
> the inserting examples in http://arxiv.org/html/cs.DB/0401014 don't work

What exactly doesn't work?

BTW, did you get the book chapter? It contains everything: insertion,
queries, translation to and from materialized path encoding, and most
important of all, the encoding is tricked to be adjacency relation as
well.

amado...@netcabo.pt

unread,

Nov 12, 2005, 10:21:29 PM11/12/05

to

<<A labeled edge

attr. name
subject -------------> value

give rise to ambiguity

subject
attr.name -------------> value >>

Indeed. See the given interpretations of the untyped connection,
specially (3).

Thanks a lot for the Formal Concept Analysis reference. I'll surely
take a look.

"Nobody said the other models don't have the right to exist. Hey, there

is even connection of formal concept analysis to relational algebra."

Mitiko et al., I'm thrilled that you took the time to review these sort
of non-relational ideas. Thanks a lot. Every two years or so I babble
here and there some non-relational thoughts, and I think it's the first
time they've met with some rational appreciation. Maybe it's a sign of
times, namely the spectre of "semi-structured data" (oops, I've said
it) upon us, or my system got better, or both. Anyway thanks a lot, and
I totally retract what I said about you being under the Myth.

"It is simply that the alternative models are less mature and never
enjoyed
the same level of success as RM."

And 30 years from now someone will be saying that substituting MC for
RM :-)

asdf

unread,

Nov 13, 2005, 5:37:12 PM11/13/05

to

Yes, but my current infrastructure is that the fractions are stored in
the same table as the category id. The book mentioned that I have to
store the fractions on a seperate table. I use example 2 of the
insertion example in http://arxiv.org/html/cs.DB/0401014 but it didn't
work. How do I add a child node to the parent node correctory? How do I
query the immediate children and insert a node after the oldest child
with the fractions stored in the same rows? I wrote code that converts
materialized path to fractions and vice versa, but how do I use it to
insert new nodes?

Thanks

Vadim Tropashko

unread,

Nov 14, 2005, 1:52:56 PM11/14/05

to

Let's get back to the basics of an RDBMS. Your questions are all based
on the wrong mental model. Rows are not records; fields are not
columns; tables are not files. Edges are Not Nodes/Vertices! Just
kidding. I indeed wrote that edges and nodes should be kept in separate
tables. However, this has been written in chapter on graphs, not trees
(the next chapter). If this were written for trees, please let me know
the page number, as I have to correct it. The idea of tree encoding is
that we add one or several more columns to the table of *nodes*.

OK, going back to farey fractions article. When saying that something
is not working please be more specific, and tell what query/dml failed,
and, more important, on what set of data. For example, make a tree with
3 nodes, and show that insertion of the 4th node fails.

There are 3 improvements that make migrating to the matrix encoding
attractive:
1. Faster algorithm for finding next fraction.
2. An algorithm for moving subtrees.
3. Finding children via adjacency relation.
I don't quite understand your reasons why you don't want to use matrix
encoding. (Although, some developers are quick to dismiss the entire
nested intervals because it allegedly requires to use stored
procedures.)

asdf

unread,

Nov 15, 2005, 7:50:34 PM11/15/05

to

Since I like to use an array of folders, I would prefer to use the
adjacency list. I just found out that getting an ID out of an array of
folders in the URL is extremely efficient with the adjacency list, the
same efficency as nested intervals. Also, adjacency list is extremely
fast for moving categories. Adjacency list gets the same speed of
nested intervals if I use an array of folders in the URL.

If I use an ID in the URL instead of an array of folders, it would turn
out very inefficient because it has to use multiple joins to find the
path. However, when I use an array of folders in the URL, the path is
already in the URL, so I don't have to find the path again. That method
is very efficient.

Counting all of the listings recursively in a subcat is anyway very
inefficient for all of the methods so I would indeed use an adjacency
set.How would I count all of the listings recursively in a subcat
efficently? Should I store the number of listings below every cat so I
don't have to recursively use the count function? When I was in the
'Arts' category on dmoz.org, and I refreshed that category a couple of
times, on every page load, it showed the numbers of listings
differently every time in some subcats in the 'Arts' category. Do I
have to add a row that stores how many listings are in each category,
or are there any other efficient ways to do this? Because when every
time I insert a new listing (web page listing, not a node in a tree) in
a category, all of the ancestors listings_count row have to be updated.
That would be inefficient because the higher the category, the more
often listings_count row have to be updated, and it would slow down the
directory sufficiently because the hard disk have to rotate to the
category_count row frequently so it would slow down other stuff (like
selecting categories) considerably. How does dmoz.org do it?

However, the nested interval encoding can fit into the adjacency-list
model, so I could done that with nested intervals with the same
efficency as adjacency-list. Are there any special advantages for the
nested intervals model over the extremely efficent "adjacency-list
model with the array of folders in the URL"? If there aren't any
special advantages in the nested intervals model, then I would use the
adjacency list.

Anyway, I was very "stupid" to not think that getting an ID out of an
array of folders in the URL is very inefficient.

Vadim Tropashko

unread,

Nov 15, 2005, 10:02:22 PM11/15/05

to

asdf wrote:
> Since I like to use an array of folders, I would prefer to use the
> adjacency list. I just found out that getting an ID out of an array of
> folders in the URL is extremely efficient with the adjacency list, the
> same efficency as nested intervals. Also, adjacency list is extremely
> fast for moving categories. Adjacency list gets the same speed of
> nested intervals if I use an array of folders in the URL.

What is "array of folders"? Materialized path?
What is adjacency list? Celko calls adjacency relation as adjacency
list.

> If I use an ID in the URL instead of an array of folders, it would turn
> out very inefficient because it has to use multiple joins to find the
> path. However, when I use an array of folders in the URL, the path is
> already in the URL, so I don't have to find the path again. That method
> is very efficient.
>
> Counting all of the listings recursively in a subcat is anyway very
> inefficient

Counting the size of the subtree is efficient for Nested Sets only. It
is (relatively) inefficient for all the other models.

> for all of the methods so I would indeed use an adjacency
> set.How would I count all of the listings recursively in a subcat
> efficently? Should I store the number of listings below every cat so I
> don't have to recursively use the count function?

You can indeed store the size of a subtree in each node. The update
penalty is refreshing all the nodes on the path to the root (which is
small compared to the size of a tree).

> When I was in the
> 'Arts' category on dmoz.org, and I refreshed that category a couple of
> times, on every page load, it showed the numbers of listings
> differently every time in some subcats in the 'Arts' category. Do I
> have to add a row that stores how many listings are in each category,
> or are there any other efficient ways to do this?

Why do you need a row? I thought you need a column.

> Because when every
> time I insert a new listing (web page listing, not a node in a tree) in
> a category, all of the ancestors listings_count row have to be updated.
> That would be inefficient because the higher the category, the more
> often listings_count row have to be updated, and it would slow down the
> directory sufficiently because the hard disk have to rotate to the
> category_count row frequently so it would slow down other stuff (like
> selecting categories) considerably. How does dmoz.org do it?

If you ever find out, please post it here:-)

> However, the nested interval encoding can fit into the adjacency-list
> model, so I could done that with nested intervals with the same
> efficency as adjacency-list. Are there any special advantages for the
> nested intervals model over the extremely efficent "adjacency-list
> model with the array of folders in the URL"? If there aren't any
> special advantages in the nested intervals model, then I would use the
> adjacency list.

Well, I'm confused with your terminology. Here is a brief summary of
each model anyway:

1. Adjacency relation (tree edges; standalone, or combined with the
tree nodes):
- Have to use proprietory SQL extensions for finding ancestors and
descendants; although the queries are efficient
- Finding descendants is relatively efficient (i.e. proportional to the
size of the subtree)
- Finding node's children is trivial
- Finding node's parent is trivial
- Aggregate queries are relatively slow
- Tree reorg is very simple

2. Nested Sets
- Finding descendants is easy and relatively efficient (i.e.
proportional to the size of the subtree)
- Finding ancestors is easy but inefficient
- Finding node's children as all the descendants restricted to the next
level is inefficient (e.g. consider root node)
- Finding node's parent as ancestor on the previous level is
inefficient due to inefficiency of ancestors search
- Aggregate queries are relatively slow (except counting, which is
super fast)!
- Tree reorg is hard

3. Materialized Path
- Finding descendants is easy and relatively efficient (i.e.
proportional to the size of the subtree)
- Finding ancestors is tricky but efficient
- Finding node's children as descendants on next level is inefficient
- Finding node's parent as ancestor on the previous level is efficient
- All aggregate queries are relatively slow
- Tree reorg is easy

4. Nested Intervals
- Same as MP: Finding descendants is easy and and relatively efficient
(i.e. proportional to the size of the subtree)
- Same as MP: Finding ancestors is tricky but efficient
- Same as AR: Finding node's children is trivial
- Same as AR: Finding node's parent is trivial
- All aggregate queries are relatively slow
- Tree reorg is easy (but not as simple as in AR)

DonR

unread,

Nov 16, 2005, 1:17:56 PM11/16/05

to

Micheal and Dawn,
I see the folks here just ignored the Pick suggestion. This has always
been Pick's problem, the guru's think things have to be complicated to
work.
Of course, if they were using Pick, they wouldn't need to be on here
discussing meaningless theory.

I've worked with Pick for 20 years and haven't found a business problem
that Pick couldn't handle.
Regards,
Don

asdf

unread,

Nov 16, 2005, 3:33:56 PM11/16/05

to

The array of folders for:

http://dir.example.com/Games/Board_Games/Abstract/Chess/Software/Web_Chess_Viewers/

is:

folders[0] = 'Games'
folders[1] = 'Board Games'
folders[2] = 'Abstract'
folders[3] = 'Chess'
...

so the id can be queried very efficiently by first getting the ID for
the row name = 'Games' AND parent = 0, since the main category's id is
'0'. It then gets the id for the category called 'Games' and continue
on with name='Board Games' AND parent equals the id for the row named
'Board Games', then continues on until it ends on the last element of
the array. So this is very efficient. It has to do multiple SELECTs.
Are multiple SELECTs efficient?

Getting the path does not require multiple JOINs, just multiple
SELECTs, so the speed is equal to getting the id from the path shown
above.

So what are the advantages for Nested Intervals?
What are the differences in terms of speed between Nested Intervals and
LDAP?
Is this method more efficient than the speed of LDAP?

Vadim Tropashko

unread,

Nov 16, 2005, 4:38:11 PM11/16/05

to

asdf wrote:
> The array of folders for:
>
> http://dir.example.com/Games/Board_Games/Abstract/Chess/Software/Web_Chess_Viewers/
>
> is:
>
> folders[0] = 'Games'
> folders[1] = 'Board Games'
> folders[2] = 'Abstract'
> folders[3] = 'Chess'
> ...
>
> so the id can be queried very efficiently by first getting the ID for
> the row name = 'Games' AND parent = 0, since the main category's id is
> '0'. It then gets the id for the category called 'Games' and continue
> on with name='Board Games' AND parent equals the id for the row named
> 'Board Games', then continues on until it ends on the last element of
> the array. So this is very efficient.

I have difficulty understanding without the schema. Could you please
describe what columns do you store in your table (and if there is more
than one table, describe them as well).

> It has to do multiple SELECTs.
> Are multiple SELECTs efficient?

Multiple selects can be efficient if they are not inside loops, or
recursion. In word, how many of them your application would typically
fire?

> Getting the path does not require multiple JOINs, just multiple
> SELECTs, so the speed is equal to getting the id from the path shown
> above.
>
> So what are the advantages for Nested Intervals?
> What are the differences in terms of speed between Nested Intervals and
> LDAP?
> Is this method more efficient than the speed of LDAP?

LDAP has nothing to do with SQL, or Relational Databases for that
matter.

asdf

unread,

Nov 16, 2005, 4:57:33 PM11/16/05

to

I use this method:

CREATE TABLE `web_directory` (
`id` int(8) unsigned NOT NULL auto_increment,
`parent` int(8) unsigned NOT NULL default '0',
`name` varchar(100) NOT NULL default '',
PRIMARY KEY (`id`),
KEY `name` (`name`),
KEY `parent` (`parent`)
);

...

SELECT id
FROM web_directory
WHERE name = 'Web Chess Viewers'
AND parent = (

SELECT id
FROM web_directory
WHERE name = 'Software'
AND parent = (

SELECT id
FROM web_directory
WHERE name = 'Chess'
AND parent = (

SELECT id
FROM web_directory
WHERE name = 'Abstract'
AND parent = (

SELECT id
FROM web_directory
WHERE name = 'Board Games'
AND parent = (

SELECT id
FROM web_directory
WHERE parent =0
AND name = 'Games'
LIMIT 1
)
LIMIT 1
)
LIMIT 1
)
LIMIT 1
)
LIMIT 1
)
LIMIT 1

What are the speed advantages of a web directory built from LDAP be
faster than SQL.

Vadim Tropashko

unread,

Nov 16, 2005, 5:15:18 PM11/16/05

to

This is adjacency relation. Your (dynamically constructed) query is
efficient.

> What are the speed advantages of a web directory built from LDAP be
> faster than SQL.

Maybe. SQL advantage is flexibility:
"Report all the users with the count of the number of folders they
created in the last 24 hours"
Try writing this in LDAP.

asdf

unread,

Nov 16, 2005, 5:48:59 PM11/16/05

to

To get a path from an id, how much more efficient is the nested
intervals method compared to this adjacency list model?

example of selecting the path from the adjacency list:

SELECT parent, name INTO p1, n1 FROM web_directory WHERE id = 1234
SELECT parent, name INTO p2, n2 FROM web_directory WHERE id = p1
SELECT parent, name INTO p3, n3 FROM web_directory WHERE id = p2
SELECT parent, name INTO p4, n4 FROM web_directory WHERE id = p3

Mikito Harakiri

unread,

Nov 16, 2005, 5:59:13 PM11/16/05

to

About the same on server side. Your method has small overhead incurring
while doing multiple SQL calls.

asdf

unread,

Nov 16, 2005, 6:48:41 PM11/16/05

to

Thank you very much!

You said the examples are fast, is there any significant speed
difference to the nested intervals model?
What's the speed difference between the multiple selects adjacency list
example shown above and this:

SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as
lev4 t5.name as lev5 t6.name as lev6
FROM web_directory AS t1
LEFT JOIN web_directory AS t2 ON t2.parent = t1.id
LEFT JOIN web_directory AS t3 ON t3.parent = t2.id
LEFT JOIN web_directory AS t4 ON t4.parent = t3.id
LEFT JOIN web_directory AS t5 ON t5.parent = t4.id
LEFT JOIN web_directory AS t6 ON t6.parent = t5.id
WHERE t1.name = 'Games'
AND t2.name = 'Board Games'
AND t3.name = 'Abstract';
AND t4.name = 'Chess';
AND t5.name = 'Software';
AND t6.name = 'Web Chess Viewers';

What speed advantages do the nested intervals model have that that
adjacency list doesn't?

Frank Hamersley

unread,

Nov 16, 2005, 6:54:03 PM11/16/05

to

DonR wrote:
> dawn wrote:
>
>>mic...@preece.net wrote:
>>
>>>I can't understand why no one has suggested Pick for this. Seems
>>>perfectly simple to me. Anyone who has used Pick would not think twice
>>>about this. The solution to the OP's problem would be trivial in the
>>>extreme.
>>
>>Of course you are right, Mike. smiles. --dawn
>
> Micheal and Dawn,
> I see the folks here just ignored the Pick suggestion. This has always
> been Pick's problem, the guru's think things have to be complicated to work.
> Of course, if they were using Pick, they wouldn't need to be on here
> discussing meaningless theory.

To me "meaningless" is code for don't (or won't) understand!

> I've worked with Pick for 20 years and haven't found a business problem
> that Pick couldn't handle.

I have never used (although I have worked alongside on occasions) Pick
in 25 years and haven't found a business problem that can't be solved
either.

Cheers, Frank. (Ozi Ozi Ozi Oi Oi Oi)

Mikito Harakiri

unread,

Nov 16, 2005, 7:49:48 PM11/16/05

to

Hmm, i thought that Pick, XQuery and others are still trying to figure
out how to solve this
"Find all the employees who earn more than their managers"

Mikito Harakiri

unread,

Nov 16, 2005, 8:12:23 PM11/16/05

to

asdf wrote:
> Thank you very much!
>
> You said the examples are fast, is there any significant speed
> difference to the nested intervals model?
> What's the speed difference between the multiple selects adjacency list
> example shown above and this:
>
> SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as
> lev4 t5.name as lev5 t6.name as lev6
> FROM web_directory AS t1
> LEFT JOIN web_directory AS t2 ON t2.parent = t1.id
> LEFT JOIN web_directory AS t3 ON t3.parent = t2.id
> LEFT JOIN web_directory AS t4 ON t4.parent = t3.id
> LEFT JOIN web_directory AS t5 ON t5.parent = t4.id
> LEFT JOIN web_directory AS t6 ON t6.parent = t5.id
> WHERE t1.name = 'Games'
> AND t2.name = 'Board Games'
> AND t3.name = 'Abstract';
> AND t4.name = 'Chess';
> AND t5.name = 'Software';
> AND t6.name = 'Web Chess Viewers';

Why it's outer join? I assume you've meant normal join.

This query should be quick as well, assuming that optimizer finds the
correct plan. I can imagine many nodes with the names 'Abstract' and
'Software' in different places of the hierarchy. Being realistic,
however, the optimizer would never start from those tables, which are
happen to be inner nodes in the join graph. The most likely join order
alternatives are t1->t2->t3->t4->t5->t6, which is as efficient as your
earlier version, and
t6->t5->t4->t3->t2->t1 which is also efficient. Well, you'd better have
compound index on (parent, name).

> What speed advantages do the nested intervals model have that that
> adjacency list doesn't?

Given node id find the node level.

dawn

unread,

Nov 16, 2005, 10:26:07 PM11/16/05

to

No problem for PICK. Because virtual fields can be written with query
language constructs and/or DataBASIC (likely the best implementation
based on Dartmouth BASIC ever written), there is no problem answering
such questions. The ease of use with the query language is part of the
charm of PICK, while it is not at all part of the charm of XQuery (if
you ask me). This is accomplished by setting up the vocabulary for the
user, however, and behind some of that vocabulary might just be some
procedural code.

In this old query language that has a ton of names, none of which is
generic (see my poster of the MultiValue Family tree at
http://www.tincat-group.com/mv/familytree.html to get the list of names
down the right hand side in case you want to know if you have ever
encountered it), the queries are against a single
view/vocabulary/dictionary/file.

A PICK-like query for your example would look like:

SELECT EMPLOYEES WITH ANNUAL_SALARY > MANAGER_SALARY

[Notice that entity names are often plural so the queries sound quite
English-like, with ENGLISH being one of the names for this language]

There are never joins or compound queries. There is no such thing as a
base table (or else everything could be considered such). EMPLOYEES as
a strong entity is modelled as a file, and annual salary might be a
stored data attribute or field in that file. Manager salary would be
derived data or a "virtual field" associated with this same file, but
tracing the graph from this node to the manager node and then picking
off the stored salary data there. This example would obviously not
require any procedural code since it is simply node hopping, but a
virtual field like MANAGER_YTD_PAY in the EMPLOYEES
vocabulary/dictionary list would likely have a subroutine backing it to
aggregate from detail records.

I hope that made sense. XQuery as a language for a human being to
employ to query data doesn't hold a candle to the PICK query language.
But given a raw schema for stored data only and the PICK query
language, you get nowhere since you can only see through the lens of
one file/dictionary with any query. You can write some virtual fields
on the fly within the query language, but not enough to cover.

Cheers! --dawn

DonR

unread,

Nov 17, 2005, 11:15:58 AM11/17/05

to

Frank,

> To me "meaningless" is code for don't (or won't) understand!

You're right, I don't understand some of these theories but the point
is, I don't need to.

My employer and our customers would much rather pay me for results than
theories.

This entire discussion points out the adsurd complexities and
limitations of SQL and relational systems.

Of course, Pick's biggest advantage, ease of use, is also one of it's
downfalls. Most of the companies running Pick have one or two
programmers which may have been an operator or an accountant prior to
becoming the IT manager/programmer. This has resulted in a lot of
poorly written programs and non-professionalism. Properly written Pick
software is fast, flexible, stable and easy and cheap to maintain. What
more could you want? That's why it's endured for 30 years.

Cheers,
Don

Mikito Harakiri

unread,

Nov 17, 2005, 1:06:11 PM11/17/05

to

DonR wrote:
> Properly written Pick
> software is fast, flexible, stable and easy and cheap to maintain. What
> more could you want? That's why it's endured for 30 years.

I have a differnt explanation. Software industry today is very
forgiving. You can design/implent all kind of b*it you want, and still
have a paycheck. The list is pretty long: Corba, Application
Integration, XML, OR mappers, bloated up middleware, etc. Somehow even
the crappiest products manage to survive in a certain niches. Can you
prove Pick is not one of them?

asdf

unread,

Nov 17, 2005, 3:17:41 PM11/17/05

to

> > What speed advantages do the nested intervals model have that that
> > adjacency list doesn't?
>
> Given node id find the node level.

Thank you!
Finding the node level is useless for a web directory. What are the
other advantages?

Mikito Harakiri

unread,

Nov 17, 2005, 4:18:44 PM11/17/05

to

asdf wrote:
> What are the other advantages?

Write a query in the adjacency model that calculates the total number
of hits in a subhierarchy.

asdf

unread,

Nov 17, 2005, 4:48:41 PM11/17/05

to

> > What are the other advantages?
>
> Write a query in the adjacency model that calculates the total number
> of hits in a subhierarchy.

I could do that in the adjacency model recursively by storing the
number of website listings in a column.

Mikito Harakiri

unread,

Nov 17, 2005, 5:19:32 PM11/17/05

to

I don't understand did you mean "precompute hierarchical total", or
"calculate it recursively". You can precompute only a limited number of
aggregated totals. Some others you would have to query. Example:

How many subcategories called "TOOLS" are descendants in category
"/HOME"?

eg.
/HOME/GARDEN/TOOLS
/HOME/SOFTWARE/PC/TOOLS

You can find them recursively, but (unlike set of queries that find all
the ancestors) it is going to be slow. (The chain of ancestors is
normally small compared to the size of the tree, the subtree of
descendants is not).

Frank Hamersley

unread,

Nov 18, 2005, 7:26:37 AM11/18/05

to

DonR wrote:
> Frank Hamersley wrote:
>>DonR wrote:
>>>dawn wrote:
>>>>mic...@preece.net wrote:
>>>>>I can't understand why no one has suggested Pick for this. Seems
>>>>>perfectly simple to me. Anyone who has used Pick would not think twice
>>>>>about this. The solution to the OP's problem would be trivial in the
>>>>>extreme.
>>>>
>>>>Of course you are right, Mike. smiles. --dawn
>>>
>>>Micheal and Dawn,
>>>I see the folks here just ignored the Pick suggestion. This has always
>>>been Pick's problem, the guru's think things have to be complicated to work.
>>>Of course, if they were using Pick, they wouldn't need to be on here
>>>discussing meaningless theory.
>>
>>To me "meaningless" is code for don't (or won't) understand!
>>
>>>I've worked with Pick for 20 years and haven't found a business problem
>>>that Pick couldn't handle.
>>
>>I have never used (although I have worked alongside on occasions) Pick
>>in 25 years and haven't found a business problem that can't be solved
>>either.
>>
>>Cheers, Frank. (Ozi Ozi Ozi Oi Oi Oi)
>

>>To me "meaningless" is code for don't (or won't) understand!
>
> You're right, I don't understand some of these theories but the point
> is, I don't need to.

A bold stance - perhaps foolhardy - but either way a personal choice.

> My employer and our customers would much rather pay me for results than
> theories.

Personally, I have never found theories (or to give more credit,
precepts widely held to be useful) a significant impediment to getting
results. However I personally eschew the instant gratification model
that is prevalent these days, as my experience has shown its not always
as simple as the first flush of "results" oft convinces less cynical types.

> This entire discussion points out the adsurd complexities and
> limitations of SQL and relational systems.

Absurd? Complexities? IMO its all quite simple and KISS has never been
absurd in my world. Further to you point - why is it axiomatic that
limitations are a Bad Thing (tm)?

> Of course, Pick's biggest advantage, ease of use, is also one of it's
> downfalls. Most of the companies running Pick have one or two
> programmers which may have been an operator or an accountant prior to
> becoming the IT manager/programmer. This has resulted in a lot of
> poorly written programs and non-professionalism.

This is the root problem of the highly flexible tool in the hands of the
unskilled. A shifting spanner is the ideal proof of this problem! The
unskilled users actually start to believe in their own work because the
environment goes to lengths to try and save them from themselves. Much
better that they get found out sooner before resources and time are
needlessly consumed.

> Properly written Pick
> software is fast, flexible, stable and easy and cheap to maintain. What
> more could you want? That's why it's endured for 30 years.

Properly written anything is all those things. Neither Pick nor SQL has
a monopoly on this.

Cheers Frank (Germany, here we come!)

Frank Hamersley

unread,

Nov 18, 2005, 7:35:25 AM11/18/05

to

All true (don't get me started on my pet topic "instant gratification"
aka fairy floss) - even the disastrous ones survive like zombies in the
few projects that used them and got to first base for other reasons.

Personally I don't think Pick has those hallmarks as it still has some
adherents. However in 30 years of opportunity it hasn't totally
dominated the world so I feel quite relaxed about consigning it to the
also ran category as I doubt very much if it would achieve that in
another 30 years.

Cheers Frank.

asdf

unread,

Nov 18, 2005, 4:44:05 PM11/18/05

to

Thank you very much.

I have the category in one table. The website listings in another
table. Each website listing contains a cat_id referencing the 'id'
column in the 'web_directory' table. Searches are done by matching the
listings title, description, etc. What is the most efficient way to
restrict the results to a specific category?

How efficient is the nested intervals model compared to the adjacency
model for moving categories?

Thank you very much.

asdf

unread,

Nov 18, 2005, 5:28:22 PM11/18/05

to

How do I convert the adjacency list model to the nested intervals model?

asdf

unread,

Nov 18, 2005, 7:08:33 PM11/18/05

to

> > Because when every
> > time I insert a new listing (web page listing, not a node in a tree) in
> > a category, all of the ancestors listings_count row have to be updated.
> > That would be inefficient because the higher the category, the more
> > often listings_count row have to be updated, and it would slow down the
> > directory sufficiently because the hard disk have to rotate to the
> > category_count row frequently so it would slow down other stuff (like
> > selecting categories) considerably. How does dmoz.org do it?
>
> If you ever find out, please post it here:-)

I found out at http://www.skrenta.com/ that dmoz actually uses a
flat-file VXFS filesystem

Vadim Tropashko

unread,

Nov 18, 2005, 7:15:08 PM11/18/05

to

asdf wrote:
> How do I convert the adjacency list model to the nested intervals model?

The surprising part is that you don't.

In adjacency relation you have 2 columns

table TreeNode (
id integer,
parent_id integer,
... more stuff ...
)

In the nested intervals you have 4 numbers instead of the two

table TreeNode (
a11 integer,
a12 integer, -- this is the same as parent_a11
a21 integer,
a22 integer -- this is the same as parent_a21
)

If you combine a11 with a21 into one number, on one hand, and a12 with
a22 into other you would even have the structure indistinguisheable
from adjacency relation. This was just to emphasize that matrix
encoding is adjacency relation as well.

Now, unlike the adjacency relation which gives you a lot of fredom what
id numbers to select, with matrix encoding you have to follow some
strict rules. For example, given a parent node

parent_a11 = 4
parent_a12 = 1
parent_a21 = 1
parent_a22 = 0

and you want to attach a child to it, you can't just guess some ad-hock
pair of numbers, say 34 and 67 and claim that the node

a11 = 34
a12 = 67
a21 = 4
a22 = 1

Formally, yes, this node refers to the parent node

a11 = 4
a12 = 1
a21 = 1
a22 = 0

but, unfortunately it's invalid encoding.

You attach a new child #N to the parent_a11, parent_a12, parent_a21,
parent_a22 by the formula

insert into MatrixTreeNodes (a11,a12,a21,a22) values
(:parent_a11*(:N+1) - :parent_a12,
:parent_a11,
:parent_a21*(:N+1) - :parent_a22,
:parent_a21);

This statement is on page 23 of the manuscript that you have. (BTW, how
readable do you find the manuscript? Throw in some comments please).

If you have a tree in some other encoding (being that unrestricted
adjacency relation, materialised path, nested sets, etc) you would have
to reconstruct the tree in matrix encoding node-by-node. This is true
no matter between what encodings are you migrating. (Although in some
cases life is much easier: for example, you can _calculate_
materialized path from nested intervals and otherwise, but this applies
to MP of concatenated siblings numbers, not concatenated node ids or
node names!).

Mikito Harakiri

unread,

Nov 18, 2005, 7:25:42 PM11/18/05

to

asdf wrote:
> Thank you very much.
>
> I have the category in one table. The website listings in another
> table. Each website listing contains a cat_id referencing the 'id'
> column in the 'web_directory' table. Searches are done by matching the
> listings title, description, etc. What is the most efficient way to
> restrict the results to a specific category?

I prefer formal description of the problem. In Celko words "please post
DDL"

> How efficient is the nested intervals model compared to the adjacency
> model for moving categories?

Adjacency relation is more efficient for moving categories. In nested
intervals you have to recalculate encodings of all the nodes in a tree
branch. The formula for encoding recalculation is quite simple, but
still the overall efficiency can't match adjacency model where you
change the encoding of just one node.

asdf

unread,

Nov 19, 2005, 1:17:49 PM11/19/05

to

CREATE TABLE `directory` (
`id` int(6) unsigned NOT NULL auto_increment,
`name` varchar(100) NOT NULL,
`a11` int(8) unsigned NOT NULL,
`a12` int(8) unsigned NOT NULL,
`a21` int(8) unsigned NOT NULL,
`a22` int(8) unsigned NOT NULL,

PRIMARY KEY (`id`),

KEY `name` (`name`,`a11`,`a12`,`a21`,`a22`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

CREATE TABLE `listings` (
`id` int(8) NOT NULL auto_increment,
`cat_id` int(6) NOT NULL,
`url` varchar(150) NOT NULL,
`title` varchar(100) NOT NULL,
`description` text NOT NULL,

PRIMARY KEY (`id`),

KEY `cat_id` (`cat_id`),
FULLTEXT KEY `description` (`description`),
FULLTEXT KEY `title` (`title`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Suppose I want to search the listings. The search results are ordered
by relevency. If it matches the listing.title or the
listing.description, it is a result.

Now suppose I want to restrict the search results to a subcategory. How
can I do it efficently? That is, without reading much disk space,
without wasting much memory, and without checking if each result if
it's in the subcategory.

The Yahoo! Directory, Google Directory, and the Open Directory Project
uses its search engine instead of searching the directory database.
That's why their directory searches are fast.

It seems there isn't any use of nested intervals. So I should continue
using the adjacency model and a search engine instead of the nested
intervals model. In that way the searches are more accurate because the
search engine can index the web pages' text instead of only matching
the title and the description. Is there any useful reason for using the
nested intervals model for a web directory?

Mikito Harakiri

unread,

Nov 19, 2005, 11:16:55 PM11/19/05

to

asdf wrote:
> CREATE TABLE `directory` (
> `id` int(6) unsigned NOT NULL auto_increment,
> `name` varchar(100) NOT NULL,
> `a11` int(8) unsigned NOT NULL,
> `a12` int(8) unsigned NOT NULL,
> `a21` int(8) unsigned NOT NULL,
> `a22` int(8) unsigned NOT NULL,
> PRIMARY KEY (`id`),
> KEY `name` (`name`,`a11`,`a12`,`a21`,`a22`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

(a11, a21) is unique key. Therefore you don't need id.

> CREATE TABLE `listings` (
> `id` int(8) NOT NULL auto_increment,
> `cat_id` int(6) NOT NULL,
> `url` varchar(150) NOT NULL,
> `title` varchar(100) NOT NULL,
> `description` text NOT NULL,
> PRIMARY KEY (`id`),
> KEY `cat_id` (`cat_id`),
> FULLTEXT KEY `description` (`description`),
> FULLTEXT KEY `title` (`title`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Ok, the directories are branch nodes of the tree and listings are the
leaves, right?

> Suppose I want to search the listings. The search results are ordered
> by relevency. If it matches the listing.title or the
> listing.description, it is a result.

Query example? In particular, I'm interested to know if you scan the
whole listing, or do the index range scan, or use an inverted index.
Then I can tell if this seach can be compatible with nested intervals,
as you ask in the next paragraph.

> Now suppose I want to restrict the search results to a subcategory. How
> can I do it efficently? That is, without reading much disk space,
> without wasting much memory, and without checking if each result if
> it's in the subcategory.

Nested Intervals searching within a subcategory is index range scan.
Depending how you search the listings, it may be compatible with your
search. Anyway, I don't see why you have two tables. If you combine
them together

CREATE TABLE `directoryListings` (

`name` varchar(100) NOT NULL,
`a11` int(8) unsigned NOT NULL,
`a12` int(8) unsigned NOT NULL,
`a21` int(8) unsigned NOT NULL,
`a22` int(8) unsigned NOT NULL,

`url` varchar(150) NOT NULL,
`title` varchar(100) NOT NULL,
`description` text NOT NULL,
PRIMARY KEY (`id`),
KEY `cat_id` (`cat_id`),
FULLTEXT KEY `description` (`description`),
FULLTEXT KEY `title` (`title`)

) ;

then you might introduce composite index. Although 'FULLTEXT' indicates
that you are most likely use inverted index. Let me figure out if
inverted index can be combined with normal index...

asdf

unread,

Nov 20, 2005, 12:12:44 PM11/20/05

to

I hate the nested intervals model because the adjacency list model can
just be as efficient as that model. Nested intervals is useless for a
web directory.

asdf

unread,

Dec 2, 2005, 11:49:35 PM12/2/05

to

Thank you very much, but it's bad to store the category name on every
listing on each category. You said that you can count the listings very
efficiently using the adjacency list model?

I don't want this:

CREATE TABLE `directorylistings` (
`lid` int(8) unsigned NOT NULL,
`cid` int(8) unsigned NOT NULL,

`name` varchar(100) NOT NULL,
`a11` int(8) unsigned NOT NULL,
`a12` int(8) unsigned NOT NULL,
`a21` int(8) unsigned NOT NULL,
`a22` int(8) unsigned NOT NULL,
`url` varchar(150) NOT NULL,
`title` varchar(100) NOT NULL,
`description` text NOT NULL,

PRIMARY KEY (`lid`),
KEY `right_col` (`a12`,`a22`),
KEY `atom` (`a11`,`a12`,`a21`,`a22`),
KEY `left_col_name` (`name`,`a12`,`a22`),
KEY `cid_name` (`cid`,`name`),
KEY `url` (`url`),

FULLTEXT KEY `description` (`description`),
FULLTEXT KEY `title` (`title`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

because it's a de-normalized representation of the data and wastes
space.

I would do something like this:

CREATE TABLE `directory` (
`id` int(6) unsigned NOT NULL auto_increment,
`name` varchar(100) NOT NULL,
`a11` int(8) unsigned NOT NULL,
`a12` int(8) unsigned NOT NULL,
`a21` int(8) unsigned NOT NULL,
`a22` int(8) unsigned NOT NULL,
PRIMARY KEY (`id`),

UNIQUE KEY `name` (`name`,`a11`,`a12`),
KEY `parent` (`a21`,`a22`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

CREATE TABLE `listings` (
`listing_id` int(8) unsigned NOT NULL auto_increment,
`cat_id` int(6) unsigned NOT NULL default '0',

`url` varchar(150) NOT NULL,
`title` varchar(100) NOT NULL,
`description` text NOT NULL,

`rating` tinyint(2) unsigned NOT NULL default '0',
`lft_index` double unsigned NOT NULL default '0',
`rgt_index` double unsigned NOT NULL default '0',
PRIMARY KEY (`listing_id`),

KEY `cat_id` (`cat_id`),

KEY `lft_index` (`lft_index`),
KEY `rgt_index` (`rgt_index`),
KEY `rating` (`rating`),

FULLTEXT KEY `description` (`description`),
FULLTEXT KEY `title` (`title`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

so you can select the 10 top rated listings in a specific category
efficiently by doing this:

SELECT listings.*
FROM listings
WHERE directory.a11*node.a21 >= directory.a21*node.a11
AND directory.a11*node.a22 >= directory.a21*node.a12
AND listings.lft_index
BETWEEN node.a11/node.a21 - 0.00000001 AND
(node.a11-node.a12)/(node.a21-node.a22) + 0.0000001
ORDER BY listings.rating DESC LIMIT 0, 10

so you can select the websites within a specific category that contains
"Software" by doing this:

SELECT listings.*
FROM listings
WHERE directory.a11*node.a21 >= directory.a21*node.a11
AND directory.a11*node.a22 >= directory.a21*node.a12
AND listings.lft_index
BETWEEN node.a11/node.a21 - 0.00000001 AND
(node.a11-node.a12)/(node.a21-node.a22) + 0.0000001
AND listings.title LIKE '%Software%'

Is there a better way to do this?

You said the adjacency relation can search the listings within a
specific category efficiently. How can I do that with the adjacency
relation?

Thank you very much.

asdf

unread,

Dec 3, 2005, 1:12:58 PM12/3/05

to

I actually wrote a stored procedure that converts the adjacency
relation model to the nested intervals model.

Mikito Harakiri

unread,

Dec 3, 2005, 2:58:07 PM12/3/05

to

asdf wrote:
> I don't want this:
>
> CREATE TABLE `directorylistings` (
> `lid` int(8) unsigned NOT NULL,
> `cid` int(8) unsigned NOT NULL,
> `name` varchar(100) NOT NULL,
> `a11` int(8) unsigned NOT NULL,
> `a12` int(8) unsigned NOT NULL,
> `a21` int(8) unsigned NOT NULL,
> `a22` int(8) unsigned NOT NULL,
> `url` varchar(150) NOT NULL,
> `title` varchar(100) NOT NULL,
> `description` text NOT NULL,
> PRIMARY KEY (`lid`),
> KEY `right_col` (`a12`,`a22`),
> KEY `atom` (`a11`,`a12`,`a21`,`a22`),
> KEY `left_col_name` (`name`,`a12`,`a22`),
> KEY `cid_name` (`cid`,`name`),
> KEY `url` (`url`),
> FULLTEXT KEY `description` (`description`),
> FULLTEXT KEY `title` (`title`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
>
> because it's a de-normalized representation of the data and wastes
> space.

Tree leaves and tree nodes in your model serve the different purpose,
so that you indeed can insist storing them in a different tables. Space
argument, though, is not especially convincing. The most of the space
would be taken by the text columns: description, title, and url, and
those are represented by variable length strings. The main
justification in favor of a single table is conceptual simplicity of
having tree nodes and leaves together.

> I would do something like this:
>
> CREATE TABLE `directory` (
> `id` int(6) unsigned NOT NULL auto_increment,
> `name` varchar(100) NOT NULL,
> `a11` int(8) unsigned NOT NULL,
> `a12` int(8) unsigned NOT NULL,
> `a21` int(8) unsigned NOT NULL,
> `a22` int(8) unsigned NOT NULL,
> PRIMARY KEY (`id`),
> UNIQUE KEY `name` (`name`,`a11`,`a12`),
> KEY `parent` (`a21`,`a22`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Be warned about foreign key constraint: you can't actually declare it.
There is a subtle but important detail: root node

[1 -1]
[1 0]

refers to a parent

[1 ?]
[0 ?]

which doesn't exist. In the adjacency relation the root refers to null
parent, which allows to declare a formal foreign key integrity
constraint.

This is an interesting question that I don't know the answer yet. Your
first query as it is written would find all the children by the index
range scan and then reorder and filter top 10. Efficiency depends on
the size of the subtree. Even if we simplify your query to find a
single top rated page, it is still going to be executed this way. I can
imagine (but fail to figure out) a more sophisticated index on
combination of interval columns and ranking column that would find the
top ranking page within the interval fast.

asdf

unread,

Dec 4, 2005, 1:12:42 PM12/4/05

to

The rating is between 0 and 100.

Is this more efficient to select the top ten listings:

SELECT listings.*
FROM listings
WHERE listings.lft_index

BETWEEN node.a11/node.a21 - 0.00000001 AND

(node.a11-node.a12)/(node.a21-node.a22) + 0.00000001
AND listings.rating >= 50

ORDER BY listings.rating DESC LIMIT 0, 10

or this:

DECLARE cnt INT;
DECLARE number INT;

SET number = 75;

REPEAT

SET number = number - 3;

SELECT COUNT(*) INTO cnt
FROM listings
WHERE listings.lft_index

BETWEEN node.a11/node.a21 - 0.00000001 AND

(node.a11-node.a12)/(node.a21-node.a22) + 0.00000001
AND listings.rating >= number

UNTIL cnt >= 10
END REPEAT;

SELECT listings.*
FROM listings
WHERE listings.lft_index

BETWEEN node.a11/node.a21 - 0.00000001 AND

(node.a11-node.a12)/(node.a21-node.a22) + 0.00000001
AND listings.rating >= number

Mikito Harakiri

unread,

Dec 4, 2005, 10:08:33 PM12/4/05

to

The second solution looks more inefficient than the first one. You
simply do the same work many times.

If I understand your idea correctly, you can create composite index on
(rating,lft). But then the only way to leverage this index is when you
have equality predicate on rating. So you can do iterations like you
do: start with rating 100, and decrement rating by one in a loop.

Suppose, there are 10 sites rated 100 under some category, that we
query. Then they would be found immediately on the first iteration.
What if there is no high rated sites below some category? Then you'll
have to do many iterations, and that might be slower that the other
method.