I am building a web directory similar to the dmoz and the yahoo
directory. The categories get updated often.
How do I find the subcategories from just the names of the ancester and
the current categories? For example, both the dmoz and yahoo
directories have filesystem-like URLs instead of category IDs. How do I
do that with nested sets or nested intervals? If not, how do they do
it?
There should be a count for how many listings in each subcategory. It
is much more efficient to count how many listings are in each
subcategory with nested sets than the adjacency list. Is there a way to
count subcategories efficently with nested intervals with farey
fractions?
Thank you very much!
Subcategories can be counted "semi-efficiently", that is you have to
find all the intervals (or just all the left boundary points) that are
contained within the ancestor interval. The speed is proportional to
the size of the subtree.
Nested sets is often quoted as a solution for static hierarchies. It is
rarely mentioned that one can't query ancestors path efficiently.
Admittedly, nested sets counts the descendants nicely as rgt-lft-1, but
"count" is the only hierarchical total aggregate function that nested
sets can do efficiently.
The implementation of ancestor query in nested intervals encoding is
somewhat messy. If we ignore the efficiency, then
select count(1)
from tree t
where compare(rational(a,b), lft) > 0 and
compare(rgt(lft),rational(c,d)) >= 0
should be good enough. The [a/b, c/d] is the encoding of your node,
below which you count the descendants. This is full table scan of the
tree table. This could be improved by adding explicit predicate to turn
the access path to index range scan
select count(1)
from tree t
where compare(rational(a,b), lft) > 0 and
compare(rgt(lft),rational(c,d)) >= 0
and t.lft.num/t.lft.den between
a/b-0.00000001 and c/d+0.00000001
where the magic constant 0.00000001 is designed to offset floating
point numerical imprecision.
In the nested sets model, (rgt -lft +1 )/2 = size of subtree at this
root node.
/* Namely the story of graph-based vs. relational systems. "Relational
atabases are universally conceived of as an advance over their
predecessors network and hierarchical models." (Vadim on dbazine.com)
This is probably true in the historical context of Ingres vs. Codasyl
etc. but mostly a myth. Unfortunately Codasyl's demise is (incorrectly)
perceived as a weakness of *all* graph-based data systems. (But why are
file systems still hierarchical today?) */
Why indeed! It doesn't actually make a lot of sense. It is great for some
needs but by no means all. A more versatile organization which could be
viewed as a hierarchy (when that is useful) seems long past due.
Roy
Is it another case of a "read a book or something" ? Codd's "A
Relational Model of Data for Large Shared Databanks" may be a good
start in order to understand why "graph-based systems" are not a good
approach to managing data.
No.
<<"graph-based systems" are not a good approach to managing data>>
Repeating a myth does not make it true. I'll gladly discuss any
technical issue.
Calling Codd's work a myth is cute, but can you offer any technical
argument in favour of "graph-based systems" in addtion to what has been
rehashed on this newsgroup many times and besides "I say so" ?
I don't recall the arguments in Codd's original article of 1970 or so.
Maybe I'll take a relook next time I stop by the University library.
Anyway the following holds.
Some semantic domains are better modelled with graphs and trees than
with relations. The idioms for trees and graphs in SQL is extremely
convoluted to say the least (the word "pathetic" comes to mind). The
original problem in this thread is an concrete example.
Relational star people themselves recognized shortcomings of the
relational theory for semantic modelling. This was a major factor in
the creation of the Entity-Relationship model. And guess what: the E-R
is a graph-based model. (Not to mention UML class diagrams, but I know
what you'll say: these are for programs, not data.)
There's graph-based and graph-based. As you I also don't subscribe to
the socalled "hierarchical" and "network" models as they are often
presented in relational circles. They seem to be a heritage of Codasyl
and alike systems of the 1960's. Fortunately others kept researching
and have come up with much better graph-based models e.g. Resource
Description Framework and even XML.
This is because the domain is poorly understood. I would challenge the
idea that XML people have clearer understang of graph problems. Tags
don't magically solve problems.
> Relational star people themselves recognized shortcomings of the
> relational theory for semantic modelling. This was a major factor in
> the creation of the Entity-Relationship model. And guess what: the E-R
> is a graph-based model. (Not to mention UML class diagrams, but I know
> what you'll say: these are for programs, not data.)
Well, the join operation is another binary relation that can be
interpreted as edges in a graph. Excuse me, but that is not a
particularly deep idea. I fail to see it utilized anywhere except query
optimization.
> There's graph-based and graph-based. As you I also don't subscribe to
> the socalled "hierarchical" and "network" models as they are often
> presented in relational circles. They seem to be a heritage of Codasyl
> and alike systems of the 1960's. Fortunately others kept researching
> and have come up with much better graph-based models e.g. Resource
> Description Framework and even XML.
It is not a matter of "do hierarchical and network models make sence".
It's a matter of how nice is the underlying theory. So far relational
people seems to be unimpressed.
Before going any farther, let's get our definitions straight in order
to prevent mutual misunderstanding. Please define the "semantic
domain".
>The idioms for trees and graphs in SQL is extremely
> convoluted to say the least (the word "pathetic" comes to mind).
The tools for treating tree-like structures in RDBs are pretty
straightforward, there is nothing "convoluted" or "pathetic" about
them. In addition to what has already been suggested, every major
database has some recursive query implementation that lets the OP solve
his problem easily.
>The
> original problem in this thread is an concrete example.
See above.
>
> Relational star people themselves recognized shortcomings of the
> relational theory for semantic modelling.
Who are "relational star people" ? Also, please define "semantic
modelling" .
> This was a major factor in
> the creation of the Entity-Relationship model. And guess what: the E-R
> is a graph-based model.
So what ? Chen's E-R model is *not* the relational model . E-R models
can be mapped to relational models or non-relational models.
>(Not to mention UML class diagrams, but I know
> what you'll say: these are for programs, not data.)
Let's not mention UML ;)
>
> There's graph-based and graph-based. As you I also don't subscribe to
> the socalled "hierarchical" and "network" models as they are often
> presented in relational circles.
If you do not subscribe to, say, the network model, then you can
probably spell out how the/(a) graph-based model is different from the
network model.
>They seem to be a heritage of Codasyl
> and alike systems of the 1960's. Fortunately others kept researching
> and have come up with much better graph-based models e.g. Resource
> Description Framework and even XML.
How is RDF or XML different from the network model ?
Nested Intervals are very efficient for updating and selecting
hierarchical categories. I am creating a web application/directory from
it, but it would be more convenient for the user to retrieve its
children from an array of folder names in the URL (seperated by dots or
slashes, like google groups does for its usenet categories) instead of
a category_id.
Both dmoz and yahoo also retrieves its immediate subcategories by the
array of folder-names in the URL; Unlike some other directories that
retrives its immediate subcategories from a category_id in the URL. Is
there any efficient way to retrieve its immediate subcategories by an
array of folder-names such as dmoz and yahoo?
Basically, it's like replicating a file system: an array of folder
names can point to the category_id to retrieve their children. However,
it would take recursive SELECTs to find the category_id from the array
of folder names. I could store the whole path in each row, but it would
be very inefficient to update a category. Is there an efficient way to
do this?
Thank you very much!
So why don't you just use a simple adjacency list and a recursive query
instead ? What database are you using ?
It's unclear how one can achieve an exponential performance degradation
with an adjacency list traversal. Could you post the piece of code you
use to do that ?
There are quite a few tree traversal implementations for MS SQL Server
2000 on the Internet. You can google for "SQL Server" and "tree
traversal".
Relational *key* people (sorry, "star" was bad choice of word, unless I
cited Kleene too:-) you know and cited two, Codd, Chen, and then
there's Date, Ulman, Wiederhold, etc. and I would not mind citing here
the people siting at the W3C working groups for XQuery, XPath, etc.
E-R maps better to graph-based representations than to relational ones.
It is not a matter of "can" but of "how well."
No I won't mention UML nor say the dirty word ("semi-structured":-)
There's lots of (good) graph-based models. RDF is (very) different from
XML and both are (very) different from Codasyl and my baby called
Mneson is (very, not so much, less) different from Codasyl, XML, RDF.
(I suggest Mneson be discuss in the thread "Implementing a graph
algebra" because that's it)
The "network" model in particular, or Codasyl: nodes are tables with
records and attributes. Not a pure graph. Actually not very graph-like
at all. I think the technical term is "a dirty mess."
How is RDF or XML different from Codasyl in particular: a lot
different. An RDF node is a single concept, not a table like in
Codasyl. An RDF edge is a single property out of a possibly (depending
on the meta-schema) open set of properties, not the single relation
"child" of Codasyl. Etc. RDF is a much cleaner model. A true graph
model. This has good consequences at many levels (theory, practice,
adoption). For example you can find very nice algebras for RDF out
there.
I assume that you asked how to query immediate descendants (children!)
in the nested intervals model. I've emailed the chapter snippet on
nested interval encoding to your yahoo email address. Look up page 15
where it develops that nested intervals matrix encoding is an adjacency
model.
Mike.
An employee and all their Supervisors, no matter how deep the
organizational tree, with a column for sorting the path from root to a
node:
SELECT O2.emp, (O2.rgt- O2.lft) AS path_order
FROM OrgChart AS O1, OrgChart AS O2
WHERE O1.lft BETWEEN O2.lft AND O2.rgt
AND O1.emp = @myemployee
ORDER BY path_order DESC;
So what's the difference between a domain and a semantic domain ? If there
is none, why muddy waters with the fancy names like "semantic" ?
>
> Relational *key* people (sorry, "star" was bad choice of word, unless I
> cited Kleene too:-) you know and cited two, Codd, Chen, and then
> there's Date, Ulman, Wiederhold, etc. and I would not mind citing here
> the people siting at the W3C working groups for XQuery, XPath, etc.
>
> E-R maps better to graph-based representations than to relational ones.
> It is not a matter of "can" but of "how well."
>
> No I won't mention UML nor say the dirty word ("semi-structured":-)
>
> There's lots of (good) graph-based models. RDF is (very) different from
> XML and both are (very) different from Codasyl and my baby called
> Mneson is (very, not so much, less) different from Codasyl, XML, RDF.
> (I suggest Mneson be discuss in the thread "Implementing a graph
> algebra" because that's it)
Could you pleas give a definition of the graph algebra ?
>
> The "network" model in particular, or Codasyl: nodes are tables with
> records and attributes. Not a pure graph. Actually not very graph-like
> at all. I think the technical term is "a dirty mess."
>
> How is RDF or XML different from Codasyl in particular: a lot
> different. An RDF node is a single concept, not a table like in
> Codasyl. An RDF edge is a single property out of a possibly (depending
> on the meta-schema) open set of properties, not the single relation
> "child" of Codasyl. Etc. RDF is a much cleaner model. A true graph
> model. This has good consequences at many levels (theory, practice,
> adoption). For example you can find very nice algebras for RDF out
> there.
In order to be more specific, would you please elaborate on the RDF algebra.
like, give a definition of the algebra ?
>
We have already discussed this. The predicate
O1.lft BETWEEN O2.lft AND O2.rgt
looks simple, but is assymmetric regarding O1 and O2. In the acces path
driven from O2 to O1, the O1 set could be found by index range scan.
Find the first node in the b-tree satisfying
O1.lft BETWEEN O2.lft AND O2.rgt
then go to the next b-tree leaf, and so on. This is very efficient.
In contrast, the access path driven from O1 to O2 you have either to
scan the whole table O2, or use index range scan that would scan an
interval with one boundary condition only
O1.lft >= O2.lft
or
O1.lft <= O2.rgt
which would scan large portion of the index. This is (relatively)
inefficient.
Take a set of all tables as a set of vertices. Take a set of all
foreign key integrity constraints as a set of edges. Do you challenge
the idea that I just defined a graph?
Yes, but in such a formulation it is not RM! It is a graph model
implemented via RM.
Nobody argues that many things can be implemented via RM including
graph-based models. I could even imagine that someone could implement
OOP or AOP techniques using RM. Or, vice versa, if I implement a
behaviour of RM using Turing machine then it does not mean that Turing
has invented RM.
The question is (as far as I understand it):
Is it more appropriate to view the (data) world as a set of relations
(with some operations, constraints etc.) or as a graph?
An RDF algebra is given in "RAL: an Algebra for Querying RDF" by
Fransicar et al. There are others. I remember seing very nice
discussions in the proceedings of the WWW conference some years ago.
For XML see: "TAX: A Tree Algebra for XML" by Jagadish et al. and
"XQuery 1.0 and XPath 2.0 Formal Semantics" by the W3C, which
supersedes "The XML Query Algebra."
For my own efforts see thread "Implementing a graph algebra."
Please note that I do not particulary favour RDF. I am interested in
more general languages, ideally one capable of solving the wild
disparity of Semantic Web languages found today. Maybe a bit like what
Codd did in the 1970's w.r.t. the disparity of database models.
No, that won't do. Please give your own RDF algebra definition, not
necessarily formal. Surely it cannot be that complicated.
> For XML see: "TAX: A Tree Algebra for XML" by Jagadish et al. and
> "XQuery 1.0 and XPath 2.0 Formal Semantics" by the W3C, which
> supersedes "The XML Query Algebra."
>
> For my own efforts see thread "Implementing a graph algebra."
You did not provide any definition of the algebra. Could you please do
it now ?
>
> Please note that I do not particulary favour RDF. I am interested in
> more general languages, ideally one capable of solving the wild
> disparity of Semantic Web languages found today.
I do not understand what you mean.
> Maybe a bit like what
> Codd did in the 1970's w.r.t. the disparity of database models.
Also, you did not clarify what you've meant by a "semantic domain" as
opposed to just a "domain". Could you offer a clarification ?
I was merely suggesting that there is no more substance to the
Entity-Relationship model than in the above graph definition.
Mneson is a graph-based metamodel of data, based on a graph with the
following properties:
1. Connections are binary, directed, and untyped i.e. unlabelled. Note
that being untyped entails not having an explicit identifier. A
connection is identified solely by the (ordered) pair of vertices it
represents. Hence, there are no duplicate connections. That is, for any
pair of vertices (x, y) there can be at most one connection from x to
y. For a connection from x to y we say that x is a "source" of y and y
is a "target" of x.
2. A vertex is either a datum or a pivot.
(a) A data vertex or datum represents an atomic value of data e.g. an
integer, a real number, a symbol. The value of a data vertex is the
identity of the vertex.
(b) A pivot has no intrinsic value but it has nevertheless a unique
identity. In practice a serial number is used to identify pivot
vertices.
3. There are no island vertices. There may be, however, island
subgraphs.
A database instance is expressed in the base structure by means of
special subgraph topologies that represent complex data entities like
attributes, lists, sets, etc. [1] A possible definition is as follows:
Attribute. An attribute of name x1 and value x2 is represented by an
attribute instance x3 such that x1 connects to x3 and x3 connects to
x2. The attribute is connected to its subject x0 by connecting x0 to
x3. For example suppose that x5 represents an employee; the fact that
x5 has Social Security Number, or SSN (the attribute name) 123456789
(the attribute value) is expressed by connecting: x5 to x6, "SSN" to
x6, x6 to 123456789.
Set. For a vertex x that represents a set, the targets of x represent
the members of the set. In parlance we say x is the 'head' of the set.
Note this is a different object than the algebraic type set of vertices
below.
Etc.
The topologies may overlap, e.g. a vertex x that is an attribute type
may also be the member of a set, or even a set itself in which case the
respective attribute instance is a member of x.
An algebra (called the Mneson Calculus) is defined on the base graph as
follows. The algebraic type is the set of vertices. The operations
include the set operations (intersection, union, etc.) and the (unary)
adjacency operators T, S for Targets, Sources, with the obvious
semantics:
T (X) = {y : y is a target of x, x in X}
S (X) = {y : y is a source of x, x in X}
The Mneson Calculus serves as a query language. For example, the
following query retrieves the employee with SSN 123456789:
T ({Employees}) ^ S [T ({"SSN"}) ^ S({123456789})]
where ^ means set intersection.
Singleton arguments are very common so we take the liberty of ommiting
the {}, for clarity; and we want to focus on algebraic issues so often
we also ommit the "" around literals. With these simplifications the
expression above becomes
T (Employees) ^ S [T (SSN) ^ S (123456789)]
This example of course assumes a schema where vertex Employees is the
head of the set of employees.
________
[1] These entities are also called "semantic" structures because they
are designed to model semantic domains. Note how the semantic
structures are in turn modelled in terms of the base untyped graph. So
formally we have a chain of domains here and we say "semantic domain"
do describe the former level. Sometimes I say "semantic domain" even
when it is not necessary to distinguish, for habit I guess, I am sorry
if I sound pedantic, please just ignore the word on those occasions.
Do you mean to say that the entity is presented as a graph whose nodes
are values of some type and edges are directed paths with two paths (S
and T) per each node ? If so, how is it different or better that the
traditional network model ?
are values of some type and edges are directed paths with two paths (S
and T) per each node?"
No. A node is either a datum or a pivot. An edge is directed. S, T
means the sources, targets of a node, if any. A node may have any
number of M sources, any number N of targets, as long as M + N > 0.
For a connection from x to y we say x is a source of y, y is a target
of x.
S, T are also called A+, A- in the literature (A from "Adjacent").
The "traditional" network model being Codasyl, nodes are tables, and
edges have a predefined semantic value, member I think, I don't have
the reference at hand, anyway very different (and anyway I'm even less
interested in Codasyl than in RDF).
So how would you represent , say, three entities like 'employee',
'department', 'company' ?
Employees => x10
x10 => x11
WorksIn => x11
x11 => Accounting
where x => y means connected from x to y. Note that this model makes
similarly possible and simple to query for where does x10 work
T (T (x10) ^ T (WorksIn))
and for all workers in accounting
T (Employees) ^ S [S (Accounting) ^ T (WorksIn)]
(A more demanding schema would require a department be also a pivot, so
it could have attributes attached, and be a member of a Departments
like employee x10 is a member of Employees.)
Before going to queries, could you please work out, that is demontrate
what the graph is for the following two relations
Employees
Name Dept Salary
------ ------ -------
Smith 1 1000
Scott 2 2000
King 2 3000
Departments
Deptno Name
------ ------
1 Accounting
2 Manufacturing
digraph {
Employees -> x1
Employees -> x2
Employees -> x3
Name -> x4
Name -> x5
Name -> x6
Dept -> x7
Dept -> x8
Dept -> x9
Salary -> x10
Salary -> x11
Salary -> x12
Departments -> x13
Departments -> x14
Deptno -> x15
Deptno -> x16
Name -> x17
Name -> x18
x1 -> x4 -> Smith
x1 -> x7 -> x13
x1 -> x10 -> 1000
x2 -> x5 -> Scott
x2 -> x8 -> x14
x2 -> x11 -> 2000
x3 -> x6 -> King
x3 -> x9 -> x14
x3 -> x12 -> 3000
x13 -> x15 -> 1
x13 -> x17 -> Accounting
x14 -> x16 -> 2
x14 -> x18 -> Manufacturing
}
Note this is a canonical, literal translation of the given relational
database. Modelling the semantic domain directly on Mneson would
simplify things. For example the Deptno is probably not necessary
semantically but just a requirement of the relational model for a key,
or for the connection (join), or both. Mneson does not need keys. A
distinct entity is a distinct vertex. Connected entities simply share a
topology.
Ok, I see. Now, please explain the following:
1. How is the graph model navigation different from the old Codasyl network
model pointer chasing, as what you apparently do is:
a. go one step down from the Employee node and collect all its children
b. go one step up from the Accounting node and collect all its "parents"
c. go one step down from WorkIn and collect all its children
d. find the intersection of sets from step (b) and (c)
e. go one step up for each element in the set constructed at step (d)
f. find the intersection of sets constructed at step (a) and (e)
The only novelty in comparison to the old network model appears to be the
ability to use sets of nodes and set operations on them (which is not quite
a novelty since as far as I remember OMG's OQL could do the same).
2. The structure you've described is not an algebra because it's not closed
over the operations you've defined (compare to the relational algebra).
3. It's unclear how you handle attributes like age or salary (find all the
employees in accounts whose age is between 30 and 40 and salary is less than
40,000).