Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Nearest Common Ancestor Report (XDb1's $1000 Challenge)

2 views
Skip to first unread message

Neo

unread,
May 16, 2004, 2:27:07 PM5/16/04
to
> Tony wrote: If your analogy holds any water at all (to give you the
> benefit of very large doubt), it suggests that relational theory will do
> just fine for pretty much anything we ever want to do "in the real world".

$1000 to the first person who replicates the equivalent of
www.xdb1.com/Example/Ex076.asp using the relational model. To claim
the prize, one needs to produce the equivalent Nearest Common Ancestor
Report from normalized and NULL-less data and the solution must be as
generic, meaning allow the user to create any hierarchy, consisting of
different types of things (each type to allow different attributes)
and each thing in the hierarchy to have any number of parents. Report
generation must not be more than 2X slower than XDb1 on equivalent
hardware.

Hugo Kornelis

unread,
May 16, 2004, 7:08:09 PM5/16/04
to

Hi Neo,

I'm not sure if I understand all requirements. You demand you have to
start from normalized data, but you fail to specify what normal form you
want: 1NF, 2NF, 3NF, etc.

Furthermore, you seem to desire the possibility to enter untyped data,
which is of course impossible in a strong-typed language. I do present
"sort of" a way to do this in a relational database, but I'd never use a
kludge even remotely like this for real. Just as I consider Xdb1 to be
completely worthless for any real probel, for exactly this same reason.
Remove types, and nothing prevents your user from entering "banana" as
John's age.

Third: you ask for the quoted example to be replicated "using the
relational model". A model is not something that can be used to produce
reports from data. I assume you meant to write "using an RDBMS"?

Many people in the comp.databases.theory group claim that MS SQL Server is
not really relational, but as I don't have access to a "real" relational
database, I hope it'll do.

One other thing that struck me in the input data:

>age isa thing.
>35 isa age.
>john is 35.
>
>weight isa thing.
>130 isa weight.
>mary is 130.

What is John were older (let's say 85) and Mary lost a lot of weight
(let's say 45 pounds). I fail to see how Xdb1 could possibly be able to
interpret the following correctly:

age isa thing.
80 isa age.
john is 80.

weight isa thing.
80 isa weight.
mary is 80.

Since 80 is both an age and a weight, how can Xdb1 make sense of this?


Below, you'll find a script that runs against MS SQL Server to reproduce
your report. There's one difference in the output: the "things" fido and
laptop1 have two common ancestors at the same distance (john and mary),
but Xdb1 is only able to find john. I consider this to be a bug in Xdb1's
output, since john is definitely NOT a "nearer" common ancestor that mary.


-- suppress "(..) row(s) affected" messages
set nocount on
-- define some tables to hold the user input
create table things (thing varchar(20) not null,
primary key (thing))
create table types (thing varchar(20) not null references things,
type varchar(20) not null,
primary key (thing))
create table attributes_int (thing varchar(20) not null references things,
attribute varchar(10) not null,
value int not null,
primary key (thing, attribute))
create table attributes_char (thing varchar(20) not null references
things,
attribute varchar(10) not null,
value varchar(100) not null,
primary key (thing, attribute))
create table hierarchies (thing varchar(20) not null references things,
otherthing varchar(20) not null references
things,
hierarchy varchar(20) not null,
primary key (hierarchy, thing, otherthing),
check (thing <> otherthing))
-- this table is used to generate the report
create table ancestors (thing varchar(20) not null,
ancestor varchar(20) not null,
dist int not null,
primary key (thing, ancestor))
-- this table will hold the report
create table NCancestor (ThingX varchar(20) not null,
ThingY varchar(20) not null,
CmnAnc varchar(20) not null,
Dist int not null,
primary key (ThingX, ThingY, CmnAnc))
go
-- this procedure creates the report
create proc CommonAncestors
(@hierarchy varchar(20))
as
-- delete data from previous execution
delete ancestors
delete NCancestor
-- starting data: thing itself is dist 0, direct leader is dist 1
insert ancestors (thing, ancestor, dist)
select distinct thing, thing, 0
from things
union all
select thing, otherthing, 1
from hierarchies
where hierarchy = @hierarchy
-- use iteration to determine indirect leaders
while (@@rowcount<>0)
insert ancestors (thing, ancestor, dist)
select distinct a1.thing, a2.ancestor, a1.dist + a2.dist
from ancestors AS a1
join ancestors AS a2
on a2.thing = a1.ancestor
where not exists
(select *
from ancestors AS a3
where a3.thing = a1.thing
and a3.ancestor = a2.ancestor)
-- now find nearest common ancestor for each pair of things
insert NCancestor (ThingX, ThingY, CmnAnc, Dist)
select a1.thing, a2.thing, a1.ancestor, a1.dist + a2.dist
from ancestors AS a1
join ancestors AS a2
on a1.ancestor = a2.ancestor
and a1.thing < a2.thing
where not exists
(select *
from ancestors AS a3
join ancestors AS a4
on a3.ancestor = a4.ancestor
and a3.thing < a4.thing
where a3.thing = a1.thing
and a4.thing = a2.thing
and a4.dist+a3.dist < a1.dist+a2.dist)
go
-- dummy execution on empty table forces compilation of stored proc
-- (I assume that Xdb1 uses a precompiled procedure as well)
exec CommonAncestors @hierarchy = 'whatever'
go
-- now add all the data
insert things (thing)
values ('god')
insert things (thing)
values ('army')
insert things (thing)
values ('trinity')
insert things (thing)
values ('john')
insert things (thing)
values ('mary')
insert things (thing)
values ('luke')
insert things (thing)
values ('fido')
insert things (thing)
values ('laptop1')
go
insert types (thing, type)
values ('army', 'force')
insert types (thing, type)
values ('trinity', 'church')
insert types (thing, type)
values ('john', 'person')
insert types (thing, type)
values ('mary', 'person')
insert types (thing, type)
values ('luke', 'person')
insert types (thing, type)
values ('fido', 'dog')
insert types (thing, type)
values ('laptop1', 'computer')
go
insert attributes_int (thing, attribute, value)
values ('john', 'age', 35)
insert attributes_int (thing, attribute, value)
values ('mary', 'weight', 130)
insert attributes_char (thing, attribute, value)
values ('luke', 'color', 'red')
go
insert hierarchies (thing, otherthing, hierarchy)
values ('army', 'god', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('trinity', 'god', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('john', 'army', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('mary', 'army', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('mary', 'trinity', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('luke', 'trinity', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('laptop1', 'john', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('laptop1', 'mary', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('fido', 'john', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('fido', 'mary', 'leader')
insert hierarchies (thing, otherthing, hierarchy)
values ('fido', 'luke', 'leader')
go
-- generate the report - show elapsed time
select current_timestamp AS 'Starting time'
exec CommonAncestors @hierarchy='leader'
select current_timestamp AS 'Ending time'
-- display the results
select *
from NCancestor
go
-- cleanup after execution
drop proc CommonAncestors
drop table NCancestor
drop table ancestors
drop table hierarchies
drop table attributes_int
drop table attributes_char
drop table types
drop table things


Output:

Starting time
------------------------------------------------------
2004-05-17 00:58:40.047

Ending time
------------------------------------------------------
2004-05-17 00:58:40.047

ThingX ThingY CmnAnc Dist
-------------------- -------------------- -------------------- -----------
army fido army 2
army god god 1
army john army 1
army laptop1 army 2
army luke god 3
army mary army 1
army trinity god 2
fido god god 3
fido john john 1
fido laptop1 john 2
fido laptop1 mary 2
fido luke luke 1
fido mary mary 1
fido trinity trinity 2
god john god 2
god laptop1 god 3
god luke god 2
god mary god 2
god trinity god 1
john laptop1 john 1
john luke god 4
john mary army 2
john trinity god 3
laptop1 luke trinity 3
laptop1 mary mary 1
laptop1 trinity trinity 2
luke mary trinity 2
luke trinity trinity 1
mary trinity trinity 1


Note: the output shows the same starting and ending date/time. This
indicates an elapsed time of less than 3 milliseconds (the smallest unit
of time CURRENT_TIMESTAMP can measure). System used: PC running Windows
2000 + MS SQL Server 2000, AMD Athlon 1.3GHz, 256MB RAM.

Of course, the performance of any query against less than 20 rows of data
is completely irrelevant for any real world need. What counts is the
performance of a query against millions of rows of data.

Groetjes, Hugo
--

Sorry, vandaag geen grappige sig lines meer.

Topmind

unread,
May 16, 2004, 11:36:02 PM5/16/04
to
> > Tony wrote: If your analogy holds any water at all (to give you the
> > benefit of very large doubt), it suggests that relational theory will do
> > just fine for pretty much anything we ever want to do "in the real world".

Relational may not be Turing Complete, but it does not have to be.
Nor is it meant to be a "total solution". It is a very helpful
tool that complements application code (and may even help
organize it).

Relational may not do everything well, just an awful lot well.
I do agree that it would be nice
if relational implementations had more
hierarchical operators, but in practice most classification
systems are not really trees when they grow beyond the
non-trivial. Trees have some nasty limitations, yet some
people keep seeing them as a the Ultimate Structure.
The real world is not tree-shaped for the most part.
Philosophers have known this for a hundred+ years.
Tree-like elements pop up, but there are usually enough
exceptions to make a pure tree impracticle. It degenerates
into a graph, and relational many-to-many tables are pretty
good at that.

>
> $1000 to the first person who replicates the equivalent of
> www.xdb1.com/Example/Ex076.asp using the relational model. To claim
> the prize, one needs to produce the equivalent Nearest Common Ancestor
> Report from normalized and NULL-less data and the solution must be as
> generic, meaning allow the user to create any hierarchy, consisting of
> different types of things (each type to allow different attributes)
> and each thing in the hierarchy to have any number of parents. Report
> generation must not be more than 2X slower than XDb1 on equivalent
> hardware.

-T-

Topmind

unread,
May 16, 2004, 11:40:36 PM5/16/04
to
> > Tony wrote: If your analogy holds any water at all (to give you the
> > benefit of very large doubt), it suggests that relational theory will do
> > just fine for pretty much anything we ever want to do "in the real world".

Relational may not be Turing Complete, but it does not have to be.


Nor is it meant to be a "total solution". It is a very helpful
tool that complements application code (and may even help
organize it).

Relational may not do everything well, just an awful lot well.
I do agree that it would be nice
if relational implementations had more
hierarchical operators, but in practice most classification
systems are not really trees when they grow beyond the
non-trivial. Trees have some nasty limitations, yet some
people keep seeing them as a the Ultimate Structure.
The real world is not tree-shaped for the most part.
Philosophers have known this for a hundred+ years.
Tree-like elements pop up, but there are usually enough
exceptions to make a pure tree impracticle. It degenerates
into a graph, and relational many-to-many tables are pretty
good at that.

>

> $1000 to the first person who replicates the equivalent of
> www.xdb1.com/Example/Ex076.asp using the relational model. To claim
> the prize, one needs to produce the equivalent Nearest Common Ancestor
> Report from normalized and NULL-less data and the solution must be as
> generic, meaning allow the user to create any hierarchy, consisting of
> different types of things (each type to allow different attributes)
> and each thing in the hierarchy to have any number of parents. Report
> generation must not be more than 2X slower than XDb1 on equivalent
> hardware.

-T-

--CELKO--

unread,
May 17, 2004, 1:35:58 PM5/17/04
to
>> $1000 to the first person who replicates the equivalent of
www.xdb1.com/Example/Ex076.asp using the relational model. To claim
the prize, one needs to produce the equivalent Nearest Common Ancestor
Report from normalized and NULL-less data and the solution must be as
generic, meaning allow the user to create any hierarchy, consisting of
different types of things (each type to allow different attributes)
and each thing in the hierarchy to have any number of parents. Report
generation must not be more than 2X slower than XDb1 on equivalent
hardware. <<

Here is the link on Amazon.com for my new book on "Trees & Hierarchies
in SQL"

http://www.amazon.com/exec/obidos/tg/detail/-/1558609202/qid=1080772873/sr=1-1/ref=sr_1_1/102-7683601-6345721?v=glance&s=books#product-details.

Another way of representing trees is to show them as nested sets.
Since SQL is a set oriented language, this is a better model than the
usual adjacency list approach you see in most text books. Let us
define a simple OrgChart table like this.

CREATE TABLE OrgChart
(emp CHAR(10) NOT NULL PRIMARY KEY,
lft INTEGER NOT NULL UNIQUE CHECK (lft > 0),
rgt INTEGER NOT NULL UNIQUE CHECK (rgt > 1),
CONSTRAINT order_okay CHECK (lft < rgt) );

I can add some more constraints to handle overlap, etc. in full
SQL-92, but let me skip those for this newsgroup posting.

The organizational chart would look like this as a directed graph:

Albert (1,12)
/ \
/ \
Bert (2,3) Chuck (4,11)
/ | \
/ | \
/ | \
/ | \
Donna (5,6) Eddie (7,8) Fred (9,10)

OrgChart
emp lft rgt
======================
'Albert' 1 12
'Bert' 2 3
'Chuck' 4 11
'Donna' 5 6
'Eddie' 7 8
'Fred' 9 10

The nodes will be in a separate table, referenced from the tree
structure by a FOREIGN KEY having the DRI actions needed for the
particular business rules.

This model has some predictable results that we can use for building
queries. The root is always (left = 1, right = 2 * (SELECT COUNT(*)
FROM TreeTable)); leaf nodes always have (left + 1 = right); subtrees
are defined by the BETWEEN predicate; etc. Here are two common queries
which can be used to build others:

1. An employee and all their Supervisors, no matter how deep the tree.

SELECT O2.*
FROM OrgChart AS O1, OrgChart AS O2
WHERE O1.lft BETWEEN O2.lft AND O2.rgt
AND O1.emp = :myemployee;

2. The employee and all subordinates. There is a nice symmetry here.

SELECT O1.*
FROM OrgChart AS O1, OrgChart AS O2
WHERE O1.lft BETWEEN O2.lft AND O2.rgt
AND O2.emp = :myemployee;

3. Add a GROUP BY and aggregate functions to these basic queries and
you have hierarchical reports. For example, the total salaries which
each employee controls:

SELECT O2.emp, SUM(S1.salary)
FROM OrgChart AS O1, OrgChart AS O2,
Salaries AS S1
WHERE O1.lft BETWEEN O2.lft AND O2.rgt
AND O1.emp = S1.emp
GROUP BY O2.emp;

4. To find the level of each emp, so you can print the tree as an
indented listing. Technically, you should declare a cursor to go with
the ORDER BY clause.

SELECT COUNT(O2.emp) AS indentation, O1.emp
FROM OrgChart AS O1, OrgChart AS O2
WHERE O1.lft BETWEEN O2.lft AND O2.rgt
GROUP BY O1.lft, O1.emp
ORDER BY O1.lft;

5. To convert a nested sets model into an adjacency list model:

SELECT B.emp AS boss, E.emp
FROM OrgChart AS E
LEFT OUTER JOIN
OrgChart AS B
ON B.lft
= (SELECT MAX(lft)
FROM OrgChart AS S
WHERE E.lft > S.lft
AND E.lft < S.rgt);


6. Given two employees, find all their common supervisors:

SELECT DISTINCT S1.emp
FROM OrgChart AS E1,
OrgChart AS E2,
OrgChart AS S1
WHERE E1.emp = :my_emp_1
AND E2.emp = :my_emp_2
AND E1.lft BETWEEN S1.lft AND S1.rgt
AND E2.lft BETWEEN S1.lft AND S1.rgt;

This is just a version of the usual nesting predicates (1) and (2).

7. Given two employees, find their nearest common supervisor,

WITH (SELECT S1.emp, (S1.rgt - S1.lft)
FROM OrgChart AS E1,
OrgChart AS E2,
OrgChart AS S1
WHERE E1.emp = :my_emp_1
AND E2.emp = :my_emp_2
AND E1.lft BETWEEN S1.lft AND S1.rgt
AND E2.lft BETWEEN S1.lft AND S1.rgt)
AS S0 (emp, gap)
SELECT S0.emp
FROM S0
WHERE S0.gap
= (SELECT MIN(gap) FROM S0);

The WITH clause is part of SQL-99 which you could replace with a VIEW
or derived tables. The gap is the "diameter" of the containing set,
and we are looking for the smallest such set. Here is the solution in
pure SQL-92
SELECT S0.emp
FROM (SELECT S1.emp, (S1.rgt - S1.lft)
FROM OrgChart AS E1,
OrgChart AS E2,
OrgChart AS S1
WHERE E1.emp = @my_emp_1
AND E2.emp = @my_emp_2
AND E1.lft BETWEEN S1.lft AND S1.rgt
AND E2.lft BETWEEN S1.lft AND S1.rgt) AS S0(emp, gap)
WHERE S0.gap
=(SELECT MIN (S1.rgt - S1.lft)
FROM OrgChart AS E1,
OrgChart AS E2,
OrgChart AS S1
WHERE E1.emp = @my_emp_1
AND E2.emp = @my_emp_2
AND E1.lft BETWEEN S1.lft AND S1.rgt
AND E2.lft BETWEEN S1.lft AND S1.rgt);

What was not given in the specs is how to handle two identical
employees; my solution is to say that they are their own nearest
supervisor.

Dawn M. Wolthuis

unread,
May 17, 2004, 4:20:26 PM5/17/04
to
"Topmind" <top...@technologist.com> wrote in message
news:4e705869.04051...@posting.google.com...

> > > Tony wrote: If your analogy holds any water at all (to give you the
> > > benefit of very large doubt), it suggests that relational theory will
do
> > > just fine for pretty much anything we ever want to do "in the real
world".
>
> Relational may not be Turing Complete, but it does not have to be.
> Nor is it meant to be a "total solution". It is a very helpful
> tool that complements application code (and may even help
> organize it).
>
> Relational may not do everything well, just an awful lot well.
> I do agree that it would be nice
> if relational implementations had more
> hierarchical operators, but in practice most classification
> systems are not really trees when they grow beyond the
> non-trivial. Trees have some nasty limitations, yet some
> people keep seeing them as a the Ultimate Structure.
> The real world is not tree-shaped for the most part.
> Philosophers have known this for a hundred+ years.
> Tree-like elements pop up, but there are usually enough
> exceptions to make a pure tree impracticle. It degenerates
> into a graph, and relational many-to-many tables are pretty
> good at that.

Just to get terminology clearer -- a tree is a graph.

I don't have a large background in graph theory, so others can chime in, but
some graphs have cycles and some don't -- some have direction, some are
strict trees. For every graph there are some corresponding trees that can
be useful in navigating the graph. The web is an implementation of a
directed graph, for example, where our road system is an implementation of a
graph, but not a directed one (or you can think of there as being directions
assigned to each street, most of those directions being both ways).
Implementations of graphs are all around us. There is nothing "ultimate"
about them, but they do provide a reasonably straightforward way to model
propositions, for example.

<snip>
--dawn


Neo

unread,
May 17, 2004, 6:12:48 PM5/17/04
to
> You ask for the quoted example to be replicated "using the

> relational model". A model is not something that can be used to produce
> reports from data. I assume you meant to write "using an RDBMS"?
> Many people in the comp.databases.theory group claim that MS SQL Server is
> not really relational, but as I don't have access to a "real" relational
> database, I hope it'll do.

Yes, I meant a reasonably close implementation of the relational model.
And yes, MS SQL Server (or MS Access) are reasonable (to me).

Neo

unread,
May 17, 2004, 8:52:49 PM5/17/04
to
> Furthermore, you seem to desire the possibility to enter untyped data,
> which is of course impossible in a strong-typed language. I do present
> "sort of" a way to do this in a relational database, but I'd never use a
> kludge even remotely like this for real. Just as I consider XDb1 to be
> completely worthless for any real problem, for exactly this same reason.

> Remove types, and nothing prevents your user from entering "banana" as
> John's age.

All things in XDb1 are typed/classified.
In XDb1, thing is the most general class.
Person's class is thing (person isa thing).
John's class is person (john isa person).
Mary's class is person (mary isa person).
Color's class is thing (color isa thing).
Red's class is color (red isa color).
Dog's class is thing (dog isa thing).
Fido's class is dog (fido isa dog). Etc...
Except for thing (which is the root),
can you name any thing in an XDb1 database that isn't classified?

In RM, classification can be accomplished similarly. By adding a row
in a table named T_Person, one effectively classifies that row as a
person. I think you may be confusing or limiting typing to hardware
types (ie bit, byte and integer, etc). XDb1's data model doesn't
require hardware to have bit, byte or integer and is implemented as
such.

Any thing in XDb1 can have multiple classifications. For example, we
can further classify John as a doctor in addition to being a person
(john isa doctor). Also if user provides 35, XDb1/application can
classify it as both an integer and age. If user provides 35.1,
XDb1/application can classify it as both a decimal and age. If user
provides 35 & 1/3, XDb1/application can classify it as both a fraction
and age. If user provides thirty-five, XDb1/application can classify
it as both a word and age. If user provides "over-the-hill",
XDb1/application can classify it as both an expression and age.

You are correct in that XDb1 does not automatically validate "basic"
classes such as bit, byte and integer. In XDb1, bits, bytes, integer
are currently classifications whose rules need to be implemented by
the user. In the future they (along with other common classifications
such as color and person) might be provided.

Suppose in the future, user creates type X. X has nothing to do with
hardware bits, bytes or integers and is not built into the db. What
will validate that x1 is a X in RM? It will be application logic just
as it is in XDb1 (even for bit, byte and integer since XDb1 doesn't
require them).

Neo

unread,
May 17, 2004, 11:52:42 PM5/17/04
to
> Remove types, and nothing prevents your user from entering "banana" as
> John's age.

It is true that RM implementations provide checking for basic types
such as integer. Thus if one trys to enter "banana" for age, it
couldn't be done (assuming the column is typed as integer). XDb1 does
not provide the usual "basic" types/classes, except those needed to
boot the db which are thing, relator, symbol and name. Currently the
"basic" types/classes have to be created and checked by the user's
code.

Now consider the type/class color. It is not a built in type. How can
RM implementations prevent the user from entering "banana" (the fruit)
for color?

In most RM implemenations, "basic" types are inextricably related to
hardware. In XDb1, no type/class (except maybe for the thing class) is
implemented in hardware. Instead, they are abstracted in the db.

Neo

unread,
May 18, 2004, 12:01:58 AM5/18/04
to
> If ....

> age isa thing.
> 80 isa age.
> john is 80.
>
> weight isa thing.
> 80 isa weight.
> mary is 80.
>
> Since 80 is both an age and a weight, how can XDb1 make sense of this?

In the cases where a name uniquely identifies a thing (ie "35 isa age"
and "130 isa weight"), the following simple format can be used:
john is 35.
mary is 130.

When a name identifies more than one thing (ie "80 isa age" and "80
isa weight") the above statements will cause XDb1 to prompt user with
"80 is ambiguous". User than has two option: a) use the GUI to point
and click the correct 80, b) use the following unambigious format:
john's age is 80.
mary's weight is 80.

Note that 80 is not both age and weight.
80 is the name of A weight
80 is also the name of A age.
The name of a thing and the thing the name identifies are two
different things.

The above statements create the RM equivalent of the following:
(Note: ->XYZ represents appropriate ID)

T_Person
ID Name Age Weight
-- ----- -------- -------------
. ->John ->Age/80
. ->Mary ->Weight/80

T_Age
ID Name
. ->80

T_Weight
ID Name
. ->80

T_Name (This "table" pre exists in XDb1, but is not pre-populated)
ID Symbo1 Sym2 Sym3 Sym4 .........
-- ------ ---- ---- ----
. ->J ->o ->h ->n
. ->M ->a ->r ->y
. ->8 ->0

T_Symbol (This "table" pre-exists in XDb1 and is populated)
ID Sym
-- ---
. 0
. ...
. 9
. a
. ...
. z
. A
. ...
. Z

Data in XDb1 is normalized down to atomic symbols.

> I'm not sure if I understand all requirements. You demand you have to
> start from normalized data, but you fail to specify what normal form you
> want: 1NF, 2NF, 3NF, etc.

Replace duplicate things with reference to the orginal using whatever
mechanism your environment provides (keys).

Neo

unread,
May 18, 2004, 2:08:50 AM5/18/04
to
> There's one difference in the output: the "things" fido and
> laptop1 have two common ancestors at the same distance (john and mary),
> but Xdb1 is only able to find john. I consider this to be a bug in Xdb1's
> output, since john is definitely NOT a "nearer" common ancestor that mary.

XDb1's code is designed to print any one of the nearest common
ancestor. If your method prints all the nearest common ancestor, that
is fine, if not better.

Neo

unread,
May 18, 2004, 3:23:06 AM5/18/04
to
> Note: the output shows the same starting and ending date/time. This
> indicates an elapsed time of less than 3 milliseconds (the smallest unit
> of time CURRENT_TIMESTAMP can measure). System used: PC running Windows
> 2000 + MS SQL Server 2000, AMD Athlon 1.3GHz, 256MB RAM.

Among other things, the difference in normalization between the
implementations (still looking it over) is quite different. Based on a
cursory look, it seems your implementation is simply utilizing the
hierarchy table and not having to resolve any keys to generate the
report (which was not the desired intent). In an initial attempt to
make them more comparable, I modified XDb1 algorithm so as to not
resolve things' names and simply prints their IDs. Based on a few
runs, the report generation time is 2 or 3 ms using the hi-resolution
QueryPerformanceCounter function provided by NT on a 500 Mhz PC. In
general, the execution is linear to PC's speed thus 2 or 3 x
(500/1,300) should be approx 0.77 to 1.2 ms on a 1.3Gz PC. These
numbers are too small to make reasonable comparisons.

Hugo Kornelis

unread,
May 18, 2004, 5:27:12 AM5/18/04
to
On 17 May 2004 17:52:49 -0700, Neo wrote:

>> Furthermore, you seem to desire the possibility to enter untyped data,
>> which is of course impossible in a strong-typed language. I do present
>> "sort of" a way to do this in a relational database, but I'd never use a
>> kludge even remotely like this for real. Just as I consider XDb1 to be
>> completely worthless for any real problem, for exactly this same reason.
>> Remove types, and nothing prevents your user from entering "banana" as
>> John's age.
>
>All things in XDb1 are typed/classified.
>In XDb1, thing is the most general class.
>Person's class is thing (person isa thing).
>John's class is person (john isa person).
>Mary's class is person (mary isa person).
>Color's class is thing (color isa thing).
>Red's class is color (red isa color).
>Dog's class is thing (dog isa thing).
>Fido's class is dog (fido isa dog). Etc...
>Except for thing (which is the root),
>can you name any thing in an XDb1 database that isn't classified?

Hi Neo,

Probably not. I expect you to know XDb1 lots better than I do, so I'll
take your word for it. But I didn't use the word "classified", I used the
word "untyped". You changed that to "typed/classified" at the beginning of
your reply, then conveniently stripped off the "typed" part in the
rhetorical question at the end of this quote.

Your explanation does raise another question. It looks as if the same
syntax is used to specify both intension and extension of the model,
thereby eliminating the distinction between schema and population. My
guess is that XDb1 would accept the following without complaining:

person isa thing.
john isa person.
mary isa person.
neo isa john.

XDb1 will probably gladly store it - but what does it mean?


>In RM, classification can be accomplished similarly. By adding a row
>in a table named T_Person, one effectively classifies that row as a
>person. I think you may be confusing or limiting typing to hardware
>types (ie bit, byte and integer, etc). XDb1's data model doesn't
>require hardware to have bit, byte or integer and is implemented as
>such.

No, I'm not confusing or limiting anything. I didn't mean hardware types
(you should know that the RM is hardware independent). I did mean
datatypes. There might be situations where an untyped database can have
it's use (I admit that my original choice of words was too harsh), but
outside of those niches, the schema should be strictly seperate from the
data and the datatype of the data should be known.


>Any thing in XDb1 can have multiple classifications. For example, we
>can further classify John as a doctor in addition to being a person
>(john isa doctor). Also if user provides 35, XDb1/application can
>classify it as both an integer and age. If user provides 35.1,
>XDb1/application can classify it as both a decimal and age. If user
>provides 35 & 1/3, XDb1/application can classify it as both a fraction
>and age. If user provides thirty-five, XDb1/application can classify
>it as both a word and age. If user provides "over-the-hill",
>XDb1/application can classify it as both an expression and age.

And this is exactly the reason why I'd never use XDb1 for serious work,
unless I encounter a problem area where the advantages of allowing untyped
data outweigh the disadvantages.

In 99.9% of all applications that store a person's age, comparisons have
to be made: who is younger than 45 years? Who is older, John, Mary or
Fido? How can XDb1 (or any other type-less database) anser the last
question is the user has provided the following input:

over-the-hill isa age.
very-young isa age.
7 isa age.
john is over-the-hill.
mary is very-young.
fido is 7.

Most computers I have used will classify very-young as greater than
over-the-hill and will refuse to compare either of these to 7.

Like I said - there may be specific situations where a product such as
XDb1 has it's use. But it's not (to quote the web site) "the future of
databases" - not even remotely!


>You are correct in that XDb1 does not automatically validate "basic"
>classes such as bit, byte and integer. In XDb1, bits, bytes, integer
>are currently classifications whose rules need to be implemented by
>the user. In the future they (along with other common classifications
>such as color and person) might be provided.
>
>Suppose in the future, user creates type X. X has nothing to do with
>hardware bits, bytes or integers and is not built into the db. What
>will validate that x1 is a X in RM? It will be application logic just
>as it is in XDb1 (even for bit, byte and integer since XDb1 doesn't
>require them).

(from another message)


>Now consider the type/class color. It is not a built in type. How can
>RM implementations prevent the user from entering "banana" (the fruit)
>for color?

My comment is not about hardware types. Nor is it about domain checking
(which is what your "color banana" example illustrates). It is about data
types. Defining a column Color as character [varying] does not ensure that
all data entered will be colors, but it does ensure that they are in the
character domain. Same as a definition of the Age column as integer
ensures that all values will be in the numerical domain, even though it is
still possible (if I don't take steps to prevent it) to enter -41 or
3,765,987 in the Age column.

Restricting the data type is not enough to ensure that values adhere to
the required domain. But it does ensure that I can perform operations on
the data and rely on the outcome. Expressions like ValueA > ValueB, ValueA
+ ValueB and ValueA - ValueB are defined for values in the numeric domain.
The first two are also defined for values in the character domain, though
the definition differs from the numeric version; the third expression type
is not defined for the character domain and would result in an error. I
know all this, and I can rely on it - if and only if the database ensures
that only numeric data will be accepted in the Age column and only
character data in the Color column.


(from yet another message)


>The above statements create the RM equivalent of the following:
>(Note: ->XYZ represents appropriate ID)

(snip)


>T_Name (This "table" pre exists in XDb1, but is not pre-populated)
>ID Symbo1 Sym2 Sym3 Sym4 .........
>-- ------ ---- ---- ----
>. ->J ->o ->h ->n
>. ->M ->a ->r ->y
>. ->8 ->0

I would not call this "table" relational. It violates 1NF. If I did have
the desire (which I don't have) to break down names and values into
individual letters and digits, I'd at least use a design without repeating
groups.

Hugo Kornelis

unread,
May 18, 2004, 5:50:23 AM5/18/04
to
On 18 May 2004 00:23:06 -0700, Neo wrote:

Hi Neo,

>> Note: the output shows the same starting and ending date/time. This
>> indicates an elapsed time of less than 3 milliseconds (the smallest unit
>> of time CURRENT_TIMESTAMP can measure). System used: PC running Windows
>> 2000 + MS SQL Server 2000, AMD Athlon 1.3GHz, 256MB RAM.
>
>Among other things, the difference in normalization between the
>implementations (still looking it over) is quite different.

Of course it is. My implementation is in the Relational Model (or at least
as close to the RM as SQL Server gets). Your implementation is in XDb1,
which is not relational at all. Implementations of this same problem in a
multi-valued database, a hierarchical database and a network database
would also use a different structure, suited to the needs and
possibilities of those databases.

I don't demand that XDb1 should use a relational structure, you should not
demand that a RM solution uses XDb1's structure. If you want to compare,
do so at the level where it counts. Check that both solutions accept the
same input (they do) and check that both solutions produce the correct
report (they do, except that XDb1 prints ony one, randomly chosen common
ancestor if several ancestors are at the same distance - but I won't hold
that against you).

> Based on a
>cursory look, it seems your implementation is simply utilizing the
>hierarchy table and not having to resolve any keys to generate the
>report (which was not the desired intent).

See above. It should not matter HOW my implementation generates the
report.

> In an initial attempt to
>make them more comparable, I modified XDb1 algorithm so as to not
>resolve things' names and simply prints their IDs.

Now you're changing the rules after the competition has started. Not a
sign of good sportsmanship, Neo!

But if it is now permitted to produce output with only meaningless IDs
instead of a human readable output, I can optimise my design as well. If I
add an integer ID column to the things table and change all referencing
columns in other tables to use that ID instead of the currect character
key, my solution's performance will increase significantly. Integer
comparison is faster than string comparison, the row size will be smaller,
resulting in more rows per data page = less I/O, etc.

> Based on a few
>runs, the report generation time is 2 or 3 ms using the hi-resolution
>QueryPerformanceCounter function provided by NT on a 500 Mhz PC. In
>general, the execution is linear to PC's speed thus 2 or 3 x
>(500/1,300) should be approx 0.77 to 1.2 ms on a 1.3Gz PC. These
>numbers are too small to make reasonable comparisons.

Yes, I remember making a similar remark myself. But it was not me who set
this challenge, that was you. :-)

So far, you failed to disprove my solution. Instead, you attempt to change
the rules and you claim that the size of the sample data (provided by
yourself!) is too small to make reasonable comparisons. Why do I get the
impression that you're trying to bail out of this?

Neo

unread,
May 18, 2004, 3:33:14 PM5/18/04
to
> > > you seem to desire the possibility to enter untyped data
> > All things in XDb1 are typed/classified.
> But I didn't use the word "classified", I used the word "untyped".

In XDb1's data model type, class and domain are the same thing. Even
C.J. Date begins one of his chapters by saying "domain is essentially
a data type".

> > > Remove types, and nothing prevents your user from entering "banana" as
> > > John's age.

Besides providing automatic checking for "basic" types, there is a
fundamental difference between RM and XDb1's data model. RM sets rules
on what data can be accepted. XDb1 accepts the data given to it and
requires app logic to properly check/classify it.

Thus, in RM, if age has a data type of integer, it will accept 8 but
not "eight".

In XDb1, the app logic would need to determine the class of the data
being given, look at the class of existing ages and then make a
decision. For example, given "eight", it would still can classify it
as an age but the second classification would be word instead of
integer.

In RM, data can have domain(s?) and data type(s?).
In XDb1, data have class(es) and data type is simply an additional
classification.

Even given the concept of data types in RM, nothing prevents the user
from entering "banana" as a color (expect for app logic).

Neo

unread,
May 18, 2004, 4:01:31 PM5/18/04
to
> the schema should be strictly seperate from the data

In XDb1's data model there is no separation between data and schema at
the db level, however the app logic can choose to make those
distinctions if needed.

> and the datatype of the data should be known.

In XDb1's data model, "data type" is simply an additional
classification of a thing. If it needs to be known, it should be
classified as such (ie. 35 isa integer).

> And this is exactly the reason why I'd never use XDb1 for serious work,
> unless I encounter a problem area where the advantages of allowing untyped
> data outweigh the disadvantages.

I think, I finally understand what you are saying. In the example,
since 35 was only classified as an age, it appears to be "untyped".
The example was simple. 35 can also be classified as an integer. Each
thing in XDb1 can have multiple classes, one of which can function
similar to its "data type".

Neo

unread,
May 18, 2004, 4:18:40 PM5/18/04
to
> My guess is that XDb1 would accept the following without complaining:
> person isa thing.
> john isa person.
> mary isa person.
> neo isa john.
>
> XDb1 will probably gladly store it - but what does it mean?

At the db level, it would mean exactly the same as:
t1 isa t.
t2 isa t1.
t3 isa t1.
t4 isa t2.

At the db level, it doesn't care what the app is trying to encode. It
is upto the app to ensure the data has relavence to something. It
could mean the following:

vehicle isa thing.
car isa vehicle.
bus isa vehicle.
corvette isa car.

If Neo is a cloned from John, then even your example above could have
meaning.

Hugo Kornelis

unread,
May 18, 2004, 4:20:46 PM5/18/04
to
On 18 May 2004 12:33:14 -0700, Neo wrote:

>> > > you seem to desire the possibility to enter untyped data
>> > All things in XDb1 are typed/classified.
>> But I didn't use the word "classified", I used the word "untyped".
>
>In XDb1's data model type, class and domain are the same thing. Even
>C.J. Date begins one of his chapters by saying "domain is essentially
>a data type".

Hi Neo,

Well, I'm not much of a theorist. Even though I'm reading and writing this
from the .theory group (I don't subscribe to the other three this is
posted in), I am more a hands-on kind of guy.

That's probably the reason why I came up with a working solution, whereas
the theorists constrain themselved to hitting you on the head with a large
club for calling SQL Server and Access "reasonably close implementations
of the relational model". (Actually, for the Access part I' tempted to ask
them if I can borrow their club for a while :-> )

From my practical, hands-on real-world background, I treat types, classes
and domains as three different things. A class is a set of individual
items, events or notions from the real world that have enough attributes
in common to warrant treating them as individual items from one set. A
domain is the complete collection of allowable values for an attribute of
individual members of a class. I like to think of a datatype as a superset
of many domains. The domain for all integers is the same as the datatype
integer. The domain for attribute "age (in years)" of the class "employee"
is a subset of the integer domain.

(snip rest of your message - it's basically a repetition of an earlier
message and I can't find any point in it that I didn't already address in
a previous post)


Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Hugo Kornelis

unread,
May 18, 2004, 4:46:30 PM5/18/04
to

Hi Neo,

Rereading what I wrote and what you replied, I may have chosen the wrong
word. (I'm Dutch - English is not my native language).

When I wrote "untyped", I meant that XDb1 performs no type checking. The
fact that you can classify 35 as an integer doesn't solve anything. I want
to be able to subtract John's age from Mary's age and be sure that the
result is their age difference, in years. That can only be achieved if the
database can be told that EACH age should be a whole number. That can be
done very easily in every RDBMS that I know. I have not yet seen how that
constraint can be built into XDb1.

Of course, you can allways do it in the application. But for something I
need very common, I prefer to have support built-in into my tools.

Note that I expect that at least 99.9% of my future work will require the
datatype of a value to be known to the programmer. This is based on 20
years of experience (where the percentage has always been 100%); the 0.1%
insecurity margin is included because I know that the unimaginable is
usually just about to happen.

Note also that MS SQL Server, the RDBMS I do most work on at the moment,
supports a datatype "sql_variant", that will store basically everything. I
have never used this datatype, not have I ever seen a question in any SQL
Server related newsgroup (and I participate in quite a few of them) where
this datatype would be the answer. And I expect I never will.

Neo

unread,
May 18, 2004, 4:52:07 PM5/18/04
to
> Who is older, John, Mary or Fido?
> How can XDb1 (or any other type-less database) answer the last
> question if the user has provided the following input:

>
> over-the-hill isa age.
> very-young isa age.
> 7 isa age.
> john is over-the-hill.
> mary is very-young.
> fido is 7.

Again, the example provided earlier was simple. 7 can be further
classified as an integer (via 7 isa integer) thus allowing operations
appropriate to the integer class (aka "data type" in RM).

> Most computers I have used will classify very-young as greater than
> over-the-hill and will refuse to compare either of these to 7.
> Like I said - there may be specific situations where a product such as
> XDb1 has it's use. But it's not (to quote the web site) "the future of
> databases" - not even remotely!

If I include human brains as computers (which they are), many of them
will come to the correct answer. XDb1 is partially a feeble attempt to
model and implement a db like that between our ears.

Hugo Kornelis

unread,
May 18, 2004, 5:00:00 PM5/18/04
to
On 18 May 2004 13:18:40 -0700, Neo wrote:

>> My guess is that XDb1 would accept the following without complaining:
>> person isa thing.
>> john isa person.
>> mary isa person.
>> neo isa john.
>>
>> XDb1 will probably gladly store it - but what does it mean?
>
>At the db level, it would mean exactly the same as:
> t1 isa t.
> t2 isa t1.
> t3 isa t1.
> t4 isa t2.
>
>At the db level, it doesn't care what the app is trying to encode. It
>is upto the app to ensure the data has relavence to something. It
>could mean the following:
>
> vehicle isa thing.
> car isa vehicle.
> bus isa vehicle.
> corvette isa car.

The English-like language you use in all XDb1 examples uses exactly the
same construction for modifying the meaning of the data ("vehicle isa
thing. / person isa thing." as opposed to "vehicle isa thing. / person isa
vehicle." and for modifying the data itself ("john isa person.")

The choice to use one syntax for modifying both intension and extension,
as opposed to the clear seperation of DML and DDL in SQL, gives me the
impression that end users of XDb1 databases will be allowed to use this
language to modify both the intension and the extension at will. While
that may be acceptable in some situations, it usually isn't,

OTOH, you sometimes write that the app should check things, like in the
above quote. That can only be done if the english-like language is not
made available to end users. But in that case, why would you bother to use
that language in the first place?


>If Neo is a cloned from John, then even your example above could have
>meaning.

:-)

Hugo Kornelis

unread,
May 18, 2004, 5:22:34 PM5/18/04
to

Hi Neo,

Just caught this message. I think I addressed most points in my previous
posts. Except for the human brain part.

This places XDb1 in a completely different league from current RDBMS's, in
my eyes. If XDb1's main objective is emulating the human brain, I have to
consider it a very interesting research project, probably worthy of loads
of government and university funding.

Until now, based on your postings and the information about XDb1 on the
web site, I was under the impression that XDb1 was meant as a serious
alternative for current databases. Clearly, I was wrong.

But I still fail to see how the $1000 challenge that started this
discussion fits into this picture.....


Seriously now: if you one day really succeed in making a program that will
gladly accept data such as "very-young" and "over-the-top" and still come
to correct conclusion on only that data, *then* you can truly say that you
hold the future of databases in your hands. As well as the future of AI.
But until that day has come, you'd better learn a little modesty.

Neo

unread,
May 18, 2004, 7:47:48 PM5/18/04
to
> > The ... statements creates the RM equivalent of the following:
> > T_Name

> > ID Symbo1 Sym2 Sym3 Sym4 .........
> > -- ------ ---- ---- ----
> > . ->J ->o ->h ->n
> > . ->M ->a ->r ->y
> > . ->8 ->0
>
> I would not call this "table" relational. It violates 1NF.

The above description of table T_Name was a simplification. Each thing
in XDb1 can have any number of names including none. Each name (ie
"John Smith", "Mary Smith") can be composed of any number of composite
symbols (ie "John", "Mary", "Smith"). Each composite symbol is
composed of any number of atomics symbols (ie 0-9, a-z, A-Z, etc). As
you correctly pointed, the composite symbols might be modelled more
appropriately as follows:

T_CompSymbol
ID Name
-- -----
. ->Smith

T_CompSymbol_AtomicSymbol_Map
CSymID ASymID Order
------- ------ ----
->Smith ->S 1
->Smith ->m 2
->Smith ->i 3
->Smith ->t 4
->Smith ->h 5

T_AtomicSymobl
ID Sym
-- ---
. a
. b
. c

> If I did have the desire (which I don't have) to break down names
> and values into individual letters and digits, I'd at least use a
> design without repeating groups.

When a user enter "john smith isa person", XDb1 normalizes that data
down to atomic symbols, without "repeating groups".

Neo

unread,
May 19, 2004, 2:20:19 AM5/19/04
to
> > Among other things, the difference in normalization between the
> > implementations (still looking it over) is quite different.
>
> I don't demand that XDb1 should use a relational structure, you should not
> demand that a RM solution uses XDb1's structure. If you want to compare,
> do so at the level where it counts.

The report your solution generates is correct, I have no disagreement
with that. But the implemenation is generating it from unnormalized
data where as XDb1 generates it from data normalized down to atomic
symbols (ie a, b, c). Yes, I consider each symbol to be data. When
XDb1 prints the words 'john' and 'god' on the report it resolves down
to the one and only symbol 'o' in the db. The implementation provided
doesn't model the atomic symbols (ie a-z). As a starting point, the
word 'john' needs to be normalized as it appears multiple times in
multiple tables. Changing any one of them would corrupt the data. In
XDb1, there is only one instance of the word 'john'.

There is another issue that I will point out in another post.



> > In an initial attempt to make them more comparable,
> > I modified XDb1 algorithm so as to not
> > resolve things' names and simply prints their IDs.
>
> Now you're changing the rules after the competition has started. Not a
> sign of good sportsmanship, Neo!

I do not think so, data must be as normalized as in XDb1's db. In
general, there is no way XDb can compete with RM where a report is
being generated from simply one table.

Hugo Kornelis

unread,
May 19, 2004, 5:49:39 PM5/19/04
to
On 18 May 2004 23:20:19 -0700, Neo wrote:

>> > Among other things, the difference in normalization between the
>> > implementations (still looking it over) is quite different.
>>
>> I don't demand that XDb1 should use a relational structure, you should not
>> demand that a RM solution uses XDb1's structure. If you want to compare,
>> do so at the level where it counts.
>
>The report your solution generates is correct, I have no disagreement
>with that.

Thanks for the acknowledgement.


> But the implemenation is generating it from unnormalized
>data

Isn't Google just great? Here are Date's formal definitions for the five
(actually six, BCNF never got it's own number) normal forms, copied and
pasted from http://www.databasedesign-resource.com/normal-forms.html

(start quote)

First Normal Form

‘A relation R is in first normal form (1NF) if and only if all underlying
domains contain atomic values only.’

Second Normal Form

‘A relation R is in second normal form (2NF) if and only if it is in 1NF
and every nonkey attribute is fully dependent on the primary key.’

Third Normal Form

‘A relation R is in third normal form (3NF) if and only if it is in 2NF
and every nonkey attribute is nontransitively dependent on the primary
key.’

Boyce/Codd Normal Form

‘A relation R is in Boyce/Codd normal form (BCNF) if and only if every
determinant is a candidate key.’

Fourth Normal Form

‘A relation R is in fourth normal form (4NF) if and only if, wherever
there exists an MVD in R, say A -> -> B, then all attributes of R are also
functionally dependent on A. In other words, the only dependencies (FDs or
MVDs) in R are of the form K -> X (i.e. a functional dependency from a
candidate key K to some other attribute X). Equivalently: R is in 4NF if
it is in BCNF and all MVD’s in R are in fact FDs.’

Fifth Normal Form

‘A relation R is in fifth normal form (5NF) – also called projection-join
normal form (PJ/NF) if and only if every join dependency in R is a
consequence of the candidate keys of R.’

For every normal form it is assumed that every occurrence of R can be
uniquely identified by a primary key using one ore more attributes in R.


FD = Functional Dependency
MVD = Multi-Valued Dependency

(end quote)

And on another website (http://nunic.nu.edu/~ckettemb/DBNorm.html):

(start quote)

Domain Key Normal Form (DKNF)

We can also always define stricter forms that take into account additional
types of dependencies and constraints. The idea behind domain-key normal
form is to specify, (theoretically, at least) the "ultimate normal form"
that takes into account all possible dependencies and constraints. A
relation is said to be in DKNF if all constraints and dependencies that
should hold on the relation can be enforced simply by enforcing the domain
constraints and the key constraints specified on the relation.

For a relation in DKNF, it becomes very straightforward to enforce the
constraints by simply checking that each attribute value in a tuple is of
the appropriate domain and that every key constraint on the relation is
enforced. However, it seems unlikely that complex constraints can be
included in a DKNF relation; hence, its practical utility is limited.

(end quote)

If you really want to maintain that my implementation generated it's
report from unnormalized data, please tell me which Normal Form you think
I violated and why.


> where as XDb1 generates it from data normalized down to atomic
>symbols (ie a, b, c). Yes, I consider each symbol to be data. When
>XDb1 prints the words 'john' and 'god' on the report it resolves down
>to the one and only symbol 'o' in the db.

But a is not an atomis symbol at all! Most computers store an a as a
collection of 8 bits: 01100001 (97, the ASCII value of 'a').

I fail to see why you would you break down 'John' into J, o, h and n, but
since that's obviously what XBd1 does, I also fail to see why the breaking
down should stop at this (arbitrarily chosen) level. Why letters? Why not
bits? nibbles? syllables?


> The implementation provided
>doesn't model the atomic symbols (ie a-z).

And neither did XDb1's model. You're confusing implementation internals
with the model. The RDBMS model had phrases such as "insert things (thing)
values ('john')" / "insert types (thing, type) values ('john', 'person')".
The XDb1 model had phrases such as "john isa person".

Though XDb1's syntax is shorter and more English-like, it doesn't drill
down to the level of letters or symbols. I don't see any 'j' on it's own,
I only see the symbols 'john' and 'person' as a whole.

How XDb1 stores these symbols internally is interesting, but totally
irrelevant to the model. Same goes for the RDBMS. I don't know how SQL
Server stores 'john', and frankly I don't want to know. Maybe it
calculates a 400-byte bit battern from the individual letters, encrypts
that via 196-bit Rijndael encryption, swaps some bits and calculates a
checksum on top of that. Maybe it uses exactly the same way to store it's
data as XDb1. Maybe it even changes the way it stores data on a daily
basis, automatically converting all data at midnight. I don't know and I
couldn't care less, as long as "select thing from things" returns "john"
and not "jhon".


> As a starting point, the
>word 'john' needs to be normalized as it appears multiple times in
>multiple tables.

And how does that violate normalization rules? If I changed the things
table to include an ID and a name, then changed all other tables to use
the ID instead of the name to refer to a thing, I'd still have the symbol
representing john multiple times in multiple tables. The fact that this
symbol now is the number 3 instead of the character string 'john' doesn't
change anything.


> Changing any one of them would corrupt the data.

No. Changing any of them would result in a foreign key violation and the
transaction would be rolled back by SQL Server. You didn't miss the
"references" clauses in my DDL, did you?


> In
>XDb1, there is only one instance of the word 'john'.

And how many pointers (or similar structures that serve the same goal) to
that word?

>
>There is another issue that I will point out in another post.

It's now 15 hours since you posted this and I still don't see it. Could
you repost that message?


>
>> > In an initial attempt to make them more comparable,
>> > I modified XDb1 algorithm so as to not
>> > resolve things' names and simply prints their IDs.
>>
>> Now you're changing the rules after the competition has started. Not a
>> sign of good sportsmanship, Neo!
>
>I do not think so, data must be as normalized as in XDb1's db.

"I challenge you to make a wooden stair with screws and a
screwdriver in at most twice the time it takes me to make a similar chair
with nails and a hammer".
"Done!"
"Hmm, nice chair indeed. And you even finished before me. But I
see that you inserted the screws by rotating the screwdriver. That was not
the intention. You should have hit the screws repeatedly on the head with
the blunt end of the screwdriver, like I hit the nails with my hammer".

The challenge was "to produce the equivalent Nearest Common Ancestor


Report from normalized and NULL-less data and the solution must be as
generic, meaning allow the user to create any hierarchy, consisting of
different types of things (each type to allow different attributes)
and each thing in the hierarchy to have any number of parents".

There was nothing in the original message or the webpage mentioned in it
about mimicing XDb1's internal data structure, not about storing the
character 'o' only once. If that had been the challenge, I'd never have
bothered - I'm not in the habit of using a screwdriver for hammering
screws in wood.

I fulfilled all requirements you set for the challenge. I just checked all
groups this is crossposted to but saw no other entries. I hereby claim the
$1000 prize. Please trasfer the money to my bank account: International
Bank Account Number (IBAN) NL59 RABO 0118 3365 68 (bank information:
BIC/SWIFT address RABONL2U).


> In
>general, there is no way XDb can compete with RM where a report is
>being generated from simply one table.

I fully agree with that. Even without the where-part.

Neo

unread,
May 19, 2004, 11:03:18 PM5/19/04
to
> > In XDb1's data model type, class and domain are the same thing.
>
> I treat types, classes and domains as three different things.
>
> A class is a set of individual items, events
> or notions from the real world that have enough attributes
> in common to warrant treating them as individual items from one set.

Agreed. In XDb1, any thing in the db can become a class.
Thus t1 becomes a class of t2 when user enter 't2 isa t1'.

> A domain is the complete collection of allowable values for an attribute of
> individual members of a class.

In RM, the attributes of a new record are stipulated by the table
design. In XDb1, a new thing can have any number of attributes
including none (user code has to stipulate otherwise). As you stated,
in RM, a record's attribute's value comes from a domain stipulated by
the table design. In XDb1, a thing can be related to any other thing
as a value. For example, t1 (john) can be related to t2 (35) as a
value via 'john is 35'. t2 can have any number of classes. If t2 is
35, it should have the classes age and integer. If t2 is
'thirty-five', it should have the classes age and word. In the
statement 't1 is t2', t2 can have any class. If user code limits t2
from a certain class, then it would be similar to a domain. Thus, I
misspoke earlier. Class and domain are related but not equal.

> a datatype as a superset of many domains.
> The domain for all integers is the same as the datatype
> integer. The domain for attribute "age (in years)" of the class "employee"
> is a subset of the integer domain.

Suppose, car is the class of car1. Car1 has the attribute color. The
attribute's value can be from the domain containing red and green.
Would data type be red, green and blue (assume these are all the
colors)? Or does data type include more things including '*&#xyzPN'?
Would a color domain delcared as char(5) have a different data type
than when declared as char(6)? Is there a difference between "type"
and "data type"? It seems to me that "data type" has a dual meaning,
one on which is related to hardware storage specifics? In reality,
colors have nothing to do with char, integer or float. But when stored
in most dbs they could have those data types. Where are the theorists
:)

Neo

unread,
May 20, 2004, 12:56:15 AM5/20/04
to
> > But the implemenation is generating [report] from unnormalized data

>
> Here are Date's formal definitions for the five
> (actually six, BCNF never got it's own number) normal forms, copied and
> pasted from http://www.databasedesign-resource.com/normal-forms.html

Ok, so Date has defined 5 or 6 normal forms. And normal forms are the
result of normalization. In his book, Date says normalization is "the
sucessive reduction of a given collection of relations to some more
desireable form". The problem with such definitions and normal forms
is that they are RM specific. XDb1 does not have relation-like
structures and the normal forms don't apply. At the begining of the
same chapter, Date gives a more general meaning of normalization which
relates to removing redundancy. In XDb1, nomalization is the process
of replacing duplicate things with a reference to the original thing.
In the most general sense, normalization is the elmination of
redundancy. The data in your tables, like the one replicated below,
have redundancy, they are not normalized.

Table hierarchies
thing otherThing hierarchy
--------- ---------- ---------
'army' 'god' 'leader'
'trinity' 'god' 'leader'
'john' 'army' 'leader'
'mary' 'army' 'leader'
'mary' 'trinity' 'leader'
'luke' 'trinity' 'leader'
'laptop1' 'john' 'leader'
'laptop1' 'mary' 'leader'
'fido' 'john' 'leader'
'fido' 'mary' 'leader'
'fido' 'luke' 'leader'


Also consider what will happen if the first person's name is 'john'
and the dog's name is also 'john'.

Neo

unread,
May 20, 2004, 2:25:21 PM5/20/04
to
> > Among other things, the difference in normalization between the
> > implementations (still looking it over) is quite different.
>
> I don't demand that XDb1 should use a relational structure,
> you should not demand that a RM solution uses XDb1's structure.

And I don't demand that a RM solution use XDb1's structure, only that
it provide a solution that normalizes the data to a similar extent.
For example, if the user names the first person 'john' and the dog
also 'john', your current solution won't work.

In XDb1, the solution will still continue to work, although it should
also print each thing's class (ie "person/john" and "dog/john") so
that user can distinguish them on the report. In XDb1, if the user
names the first person 'john', the second person 'mary' and the dog
'mary john', there would still be only one occurance of the word
'john' and one of the word 'mary' in the db ('mary john' would be
composed from references to 'mary' and 'john').

Neo

unread,
May 20, 2004, 3:04:33 PM5/20/04
to
> > Among other things, the difference in normalization between the
> > implementations is quite different.

>
> It should not matter HOW my implementation generates the report.

It does matter because a less generic and unnormalized implementation
will cause more problems or even fail under a broader scope. For
instance when the names of two things are the same.

Another isssue with the provided implementation is that the class
hierarchy is not modelled faithfully. A part of the original hierarchy
is shown below:

thing
god
person
john
mary
dog
fido ...

In the provided implementation, the class hierarchy is (flat):

thing
god
person
john
mary
dog
fido ...

Thus, if a person is named 'skippy' and a dog is also named 'skippy',
and we print a thing's class to distinguish them on the report, XDb1
would print:
person/skippy
dog/skippy

whereas the provided implementation would print:
thing/skippy
thing/skippy

Neo

unread,
May 20, 2004, 5:22:24 PM5/20/04
to
> > > This indicates an elapsed time of less than 3 ms ...

> >
> > Among other things, the difference in normalization between the
> > implementations is quite different.
> > In an initial attempt to make them more comparable...

>
> Now you're changing the rules after the competition has started.
> Not a sign of good sportsmanship, Neo!

The provided implementation was not similarly generic or normalized.
To make a very rough comparision, I simplified my implementation to
that which was provided. In general, XDb1 can't compete with RM in the
scope of a single table's worth of data since XDb1 has the overhead of
variable storage structures for each table cell. An yes, IDs are
faster than text so the comparision is far from perfect. XDb1 should
gain advantage as more tables are joined. The provided solution
doesn't join any tables since it is less generic and unnormalized.

As I said earlier, in the simplfied case, it takes XDb1 2 or 3 ms on a
500 Mhz, 512MB Dell PowerEdge Server. I ran the provided SQL Server
script on the same machine and the report generation takes 65 ms
(based on the avg of the 3 runs below):

Starting time Ending time Time Elapsed
------------- ------------ ------------
14:44:57.670 14:44:57.733 67 ms
15:02:26.233 15:02:26.297 64 ms
15:07:57.780 15:07:57.843 63 ms

What might I be doing wrong to get results different than yours?

Hugo Kornelis

unread,
May 20, 2004, 5:38:58 PM5/20/04
to
On 19 May 2004 20:03:18 -0700, Neo wrote:

(snip class/domain stuff)


>> a datatype as a superset of many domains.
>> The domain for all integers is the same as the datatype
>> integer. The domain for attribute "age (in years)" of the class "employee"
>> is a subset of the integer domain.
>
>Suppose, car is the class of car1. Car1 has the attribute color. The
>attribute's value can be from the domain containing red and green.
>Would data type be red, green and blue (assume these are all the
>colors)?

No. The data type would be character. And the domain would be {red, green}
(I have no idea why blue should be added to the domain if cars can only be
red or green).


> Or does data type include more things including '*&#xyzPN'?

Yes. As I said: the data type would be character. The restriction to
either 'red' or 'green' is in the domain.

>Would a color domain delcared as char(5) have a different data type
>than when declared as char(6)?

I consider them both to be the character domain, but the RDBMS will
obviously reserve one extra byte for storing the char(6) as opposed to the
char(5).


> Is there a difference between "type"
>and "data type"?

Without context, "type" can refer to anything: data type, entity type,
relationship type, car type - heck, it could even be a verb!

The context dictates what type is meant if the word "type" is used on it's
own. If the context indicates that "type" refers to "data type", then
there is obviously no difference.

> It seems to me that "data type" has a dual meaning,
>one on which is related to hardware storage specifics?

I'd use a slightly different formulation: data types are used to close the
gap between a domain and the representation of values by computers.

> In reality,
>colors have nothing to do with char, integer or float. But when stored
>in most dbs they could have those data types. Where are the theorists
>:)

Colors can't be stored in databases at all. The closest you'd get was if
you poured the contents of a bucket of red paint through the floppy disk
slotm of your computer - but then you'd be storing paint in your computer,
not a color.

The only thing that a database (or any other computer program) can store
is a *reference* to a color. One of the last steps of designing a model
for your data is deciding how this reference will look. Colors might be
referenced by a name (but do not forget to specify the language to use,
otherwise you'll have 'rot', 'vert' and 'blauw' in your table before you
can say RGB), but RGB values, industry-standard color numbers, an inhouse
color number or a pointer (invisible to the user, but nevertheless stored
in the computer) might be used just as well. Distances might also be
referenced by a descriptive name or a pointer, but they are commonly
referenced by a number; it remains important that everybody agrees on what
that number means: miles? kilometers? Space shuttles have blown up over
trivialities like these!

Neo

unread,
May 20, 2004, 6:07:05 PM5/20/04
to
> > Changing any one of them would corrupt the data.
>
> No. Changing any of them would result in a foreign key violation and the
> transaction would be rolled back by SQL Server. You didn't miss the
> "references" clauses in my DDL, did you?

You are correct. I could not change the name of any thing in the first
two columns of table hierarchies so I could not corrupt the data this
way; however, this design prevents two things from having the same
name, thus it is a non-generic solution.

I did manage to corrupt the data by changing one of the duplicate
'leader' to 'leader2' in the third column. You missed it here. In
XDb1, the relator 'leader' is normalized.

Hugo Kornelis

unread,
May 20, 2004, 6:33:51 PM5/20/04
to
On 19 May 2004 21:56:15 -0700, Neo wrote:

>> > But the implemenation is generating [report] from unnormalized data
>>
>> Here are Date's formal definitions for the five
>> (actually six, BCNF never got it's own number) normal forms, copied and
>> pasted from http://www.databasedesign-resource.com/normal-forms.html
>
>Ok, so Date has defined 5 or 6 normal forms. And normal forms are the
>result of normalization. In his book, Date says normalization is "the
>sucessive reduction of a given collection of relations to some more
>desireable form". The problem with such definitions and normal forms
>is that they are RM specific.

Exactly how is that a "problem"?

In your challenge, you asked for an equivalent report "using the
relational model (...) from normalized and NULL-less data". It is *you*
who asked for normalized data using the relational model. If you know a
better source for normalizing data using the relational model then Date,
please name it, with relevant URLs.

You didn't set a challenge to make a report using a model that is
normalized according to XDb1's own definition of normalisation and then
kludged in an RDBMS. You wanted us to use the relational model. The normal
forms as defined by Date are part of that model, if you like it or not.


> XDb1 does not have relation-like
>structures and the normal forms don't apply.

So? I don't have a problem with that. Please, by all means, let XDb1 store
the data in every way it seems fit, normalized or not - I couldn't care
less. The challenge was not about XDb1, it was about making a report from
normalized data using the relational model. I certainly won't ask you to
use the relation model's way of normalizing data you store in XDb1. If you
want to use a hammer and nails, I'll never ask that you insert the nails
by rotating the hammer.


> At the begining of the
>same chapter, Date gives a more general meaning of normalization which
>relates to removing redundancy.

My solution had no redundancy. I'll come back to that later.


> In XDb1, nomalization is the process
>of replacing duplicate things with a reference to the original thing.

I don't care how XDb1 normalizes. The challenge was for using the
relational model, not for using XDb1's model.


>In the most general sense, normalization is the elmination of
>redundancy. The data in your tables, like the one replicated below,
>have redundancy, they are not normalized.
>
> Table hierarchies
>thing otherThing hierarchy
>--------- ---------- ---------

(snip some rows)


>'fido' 'john' 'leader'
>'fido' 'mary' 'leader'
>'fido' 'luke' 'leader'

You might be surprised to hear this, but there's a reason that the
relational model is called "relational". It's about storing relations.

The relations stored in the non-snipped part of the above table can be
expressed as follows:
The thing referenced by the name 'fido' is just below the thing
referenced by the name 'john' in the hierarchy referenced by the name
'leader'.
The thing referenced by the name 'fido' is just below the thing
referenced by the name 'mary' in the hierarchy referenced by the name
'leader'.
The thing referenced by the name 'fido' is just below the thing
referenced by the name 'luke' in the hierarchy referenced by the name
'leader'.

I could write this out for the entire table, but my point should already
be clear: there is no redundancy. Each row in the table expresses one
relation between two instances of the type "thing" and one instance of the
type "hierarchy". Though one instance of the type "thing" may be referred
to in several of the above expressions, each expression as a whole is
different from each other expression. Nothing can be removed from the
above table without changing the meaning of the data in the database.

There would have been redundancy if each object's class was stored along
with it's reference, like this:

thing class otherThing otherClass hierarchy
--------- --------- ---------- ---------- ---------
(snip some rows)
'fido' 'dog' 'john' 'person' 'leader'
'fido' 'dog' 'mary' 'person' 'leader'
'fido' 'dog' 'luke' 'person' 'leader'

This example of a bad design has redundant storage of the same relation.
The following appears three times in the non-snipped part:
The thing referenced by the name 'fido' is member of the class referenced
by the name 'dog'.

You seem to be mixing up two things: storing the same relation multiple
times (redundancy) vs storing the same reference to a thing multiple times
(not redundancy, since the same thing can participate in many relations
and needs to be referred to in all of them).

You seem to be thinking that storing the character strings 'fido' and
'leader' several times is bad. Well, let's for argument's sake assume that
I would agree (which I don't) and that I would change the things table and
add a hierarchy_types table:

create table things (thingID int not null,
thingName varchar(20) not null,
primary key (thingID))
create table hierarchy_types (hierarchyID int not null,
hierarchyName varchar(10) not null,
primary key (hierarchyID))

That would allow me to remove all 'fido's and 'leader's from the
hierarchies table. Instead, it would now look like this:

Table hierarchies
thingID otherThingID hierarchyID
--------- ------------ -----------
(snip some rows)
17 3 1
17 4 1
17 5 1

And the table things would read:

thingID thingName
------- ---------
(...)
3 'john'
4 'mary'
5 'luke'
(...)
17 'fido'

These relations could be expressed as:
The thing referenced by the number 17 is just below the thing referenced
by the number 3 in the hierarchy referenced by the number 1.
(etc)
The thing referenced by the number 17 is referenced by humans by the name
'fido'.
(etc)

This does indeed reduce the number of times you'll see the string 'fido'
in the database to 1. But we still have multiple references to fido (note
the important distinction between fido (no quotes - a dog) and 'fido'
(quotes - a name). We didn't remove anything from the database, we only
changed the way things are references and added an extra relation.

You claim that my model has redundancy. But if I follow your suggestion of
"improving", I end up with the same data in the existing tables and some
extra relations that I could do without.

I don't know what definition of "redundant" you've been taught. I think
that "Exceeding what is necessary or natural; superfluous." (*) does quite
accurately describe the schema with both thingID and thingName, not my
original schema!

(*) Source: The American Heritage® Dictionary of the English Language,
Fourth Edition
Copyright © 2000 by Houghton Mifflin Company.
Published by Houghton Mifflin Company. All rights reserved.


>Also consider what will happen if the first person's name is 'john'
>and the dog's name is also 'john'.

In my relation implementation, this will be rejected because a primary key
constraint is violated.

XDb1 fails in a much more spectacular way, as I'll explain in another
message.

Hugo Kornelis

unread,
May 20, 2004, 6:51:50 PM5/20/04
to
On 20 May 2004 11:25:21 -0700, Neo wrote:

>> > Among other things, the difference in normalization between the
>> > implementations (still looking it over) is quite different.
>>
>> I don't demand that XDb1 should use a relational structure,
>> you should not demand that a RM solution uses XDb1's structure.
>
>And I don't demand that a RM solution use XDb1's structure, only that
>it provide a solution that normalizes the data to a similar extent.

Already adressed in another message.

Really, neo, is it really necessary to repeat the same argument over and
over again? And is it really necessary to use three seperate message to
respond to one message of mine? It's getting quite hard to keep track of
everything in here if you keep splitting the thread like this.


>For example, if the user names the first person 'john' and the dog
>also 'john', your current solution won't work.

Indeed, it won't. And neither will XDb1.


>In XDb1, the solution will still continue to work,

Should? Maybe, I don't know - I didn't design the thing. But it DOESN'T
continue to work, and that's the only thing that counts at the end of the
day.


> although it should
>also print each thing's class (ie "person/john" and "dog/john") so
>that user can distinguish them on the report.

You obviously did not test this before you wrote this. Or you are a flat
out liar (but for now, I'll give you the benefit of the doubt - by the
way, did you already instruct your bank to transfer the $1000 you owe me?)

I downloaded XDb1, created a new database and fed the data from Ex076 to
XDb1, changing only 'fido' to 'john' but nothing else (by the way, is
there no way to execute an entire script at once in XDb1? I ended up
copying and pasting the script line by line, but there's got to be a more
efficient way, right?) Here's what happened:

Up until "dog isa thing" all went fine (of course - I had not executed one
if the changed lines yet). XDb1 accepted "john isa dog" just fine (unlike
what my SQL Server implementation would have done), but in spite of what
you say that "should" happen, XDb1 did not recognise john from john. There
still was only one john, classified as both a person and a dog and aged
35. The next lines went fine, until I came to "fido leader john", which I
of course changed to "john leader john". This resulted in an error
message: "Circular relation not allow." I skipped this one, continued to
enter the rest of the expression, than generated the nearest common
ancestor report. This was the output:

Common Ancestor Report for 'god'
ThingX ThingY CmnAnc Dist
army john army 1
army laptop1 army 2
army mary army 1
army trinity god 2
army luke god 3
john laptop1 john 1
john mary mary 1
john trinity trinity 2
john luke luke 1
laptop1 mary mary 1
laptop1 trinity trinity 2
laptop1 luke luke 2
mary trinity trinity 1
mary luke trinity 2
trinity luke trinity 1

If "continue to work" means that you still get *some* output, yes, then
XDb1 still continues to work. But it's not even close to the output of the
original input (when the dog was called 'fido'). And the report certainly
doesn't print "person/john" or "dog/john".


> In XDb1, if the user
>names the first person 'john', the second person 'mary' and the dog
>'mary john', there would still be only one occurance of the word
>'john' and one of the word 'mary' in the db ('mary john' would be
>composed from references to 'mary' and 'john').

As I pointed out before - how the data is stored internally is completely
irrelevant. The relational model is not about how data is represented
inside a computer, it is about how relations between things are modelled.
You set a challenge to generate a report in the relational model, so let's
stick to that model and not discuss XDb1's internals anymore.

Hugo Kornelis

unread,
May 20, 2004, 7:13:58 PM5/20/04
to
On 20 May 2004 12:04:33 -0700, Neo wrote:

>> > Among other things, the difference in normalization between the
>> > implementations is quite different.
>>
>> It should not matter HOW my implementation generates the report.
>
>It does matter because a less generic and unnormalized implementation
>will cause more problems or even fail under a broader scope. For
>instance when the names of two things are the same.

I have to agree that like-named things cause a difference. My RDBMS
implementation will refuse such input instead of making a hideous mess of
things like XDb1 does. But let's not repeat the same things over and over
again - check my previous message for the details.


>Another isssue with the provided implementation is that the class
>hierarchy is not modelled faithfully. A part of the original hierarchy
>is shown below:
>
>thing
> god
> person
> john
> mary
> dog
> fido ...
>
>In the provided implementation, the class hierarchy is (flat):
>
>thing
> god
> person
> john
> mary
> dog
> fido ...

You obviously didn't check my model too well. If you really want to use a
tree-like notation to show how my model stores thing (which can only be
achieved by stretching things very far, since I didn't use a tree-like
model to begin with), you should arrive at this:

[thing] -- Not stored. Since everything is a thing be definition,
-- there's no need to explicitly store that information.
-- Some would even call it redundant.
god -- Leaf level, therefore it's a thing.
person -- Not leaf level, therefore it's a class.
john -- Leaf level, therefore it's a thing.
mary -- Leaf level, therefore it's a thing.
dog -- Not leaf level, therefore it's a class.
fido -- Leaf level, therefore it's a thing.

As you can see, apart from removing the redundant storing that everything
is a thing, there is no difference.

Not that it matters anyway. Models are not a goal, but a tool. They are
used to model things from the real world. The only valid comparison
between two models is in comparing what they can or can't store.

Your example of two things with the same name is a very valid way to test
these models against each other. Unfortunately (for you), the XDb1 model
failed that test in a much more spectacular way than my model.


>Thus, if a person is named 'skippy' and a dog is also named 'skippy',
>and we print a thing's class to distinguish them on the report, XDb1
>would print:
> person/skippy
> dog/skippy

No, it wouldn't. I tested it (with the name 'john' instead of 'skippy',
but I don't suppose that will matter) and XDb1 messed up big time. See my
other message for details.


>whereas the provided implementation would print:
> thing/skippy
> thing/skippy

No it wouldn't. It returns an error:

Server: Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'PK__things__64CDBE05'. Cannot insert
duplicate key in object 'things'.
The statement has been terminated.
Server: Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'PK__types__66B60677'. Cannot insert
duplicate key in object 'types'.
The statement has been terminated.

You see, unlike you I *do* test my code before I make claims here.

Neo

unread,
May 21, 2004, 1:06:44 AM5/21/04
to
> > For example, if the user names the first person 'john' and the dog
> > also 'john', your current solution won't work.
>
> Indeed, it won't. And neither will XDb1. You obviously did not test
> this before you wrote this. Or you are a flat out liar ...

It does work. The steps you took to verify it were inappropriate. The
easiest way is to use the populated db and rename fido to john. To do
this, navigate to node thing/dog/fido. Click fido twice with a small
delay between clicks to enter edit mode. Change fido to john and press
enter. The report is as before except occurrences of 'fido' are
replaced with 'john'. In this case, printing each thing's class (ie
'person/john', 'dog/john') would allow user to distinguish them.

The reason your steps did not work is because by modifying the script,
you created a person named john who you then also classified as a dog.
You never created a dog to rename. When you enter 'john isa person',
XDb1 looks for something named john in the db. Since it does not find
one, it creates a new instance of person and names it john. The same
thing occurs with 'fido isa dog'. But you modified it to 'john isa
dog'. Since XDb1 found a thing already named john, it thought you
meant the existing person. Similar to human conversations,
communication via english-like langauge can result in
misinterpretation. Using the GUI or API reduces such ambiguities.

> I ended up copying and pasting the script line by line,
> but there's got to be a more efficient way, right?

Sorry, but the ability to process multiple lines is not enabled in the
release version currently.



> As I pointed out before
> - how the data is stored internally is completely irrelevant.

True, but the solution's genericness and level of data normalization
is relevant because it can limit its scope. For instance the provided
solution fails when two things have the same name.

thirdrock

unread,
May 21, 2004, 3:32:02 AM5/21/04
to

I was cheering you on until you said this ...

>
> You might be surprised to hear this, but there's a reason that the
> relational model is called "relational". It's about storing relations.

No, unless I miss-understood what you meant by that. A relation refers to
the relationship between data members of the same 'table'. A 'table' was
originally called a relation. Joins between relations in not the reason
why the relational model is called relational.

Or is that what you said??

> There would have been redundancy if each object's class was stored along
> with it's reference, like this:
>
> thing class otherThing otherClass hierarchy
> --------- --------- ---------- ---------- ---------
> (snip some rows)
> 'fido' 'dog' 'john' 'person' 'leader'
> 'fido' 'dog' 'mary' 'person' 'leader'
> 'fido' 'dog' 'luke' 'person' 'leader'

The above table does not meet the 1N rule, correct?

>
> This example of a bad design has redundant storage of the same relation.

Er.. actually it's an example of non-normalised data. I don't think design
has anything to do with it.

>
> You seem to be thinking that storing the character strings 'fido' and
> 'leader' several times is bad.

Bad, or just not normalised?

>
> This does indeed reduce the number of times you'll see the string 'fido'
> in the database to 1. But we still have multiple references to fido (note
> the important distinction between fido (no quotes - a dog) and 'fido'
> (quotes - a name). We didn't remove anything from the database, we only
> changed the way things are references and added an extra relation.

True, but if fido goes down to the Names Registry and changes his name to
'FiFi' in preparation for a sex change operation, then that change is not
cascaded out to all the records of things. Whereas in your second table,
it does. So what exactly is your point? Did you say that your first table
is normalised?


>
> You claim that my model has redundancy. But if I follow your suggestion
> of
> "improving", I end up with the same data in the existing tables and some
> extra relations that I could do without.

No, you end up with normalised data, which I think is Neo's point (is it
Neo?)

>
> I don't know what definition of "redundant" you've been taught. I think
> that "Exceeding what is necessary or natural; superfluous." (*) does
> quite
> accurately describe the schema with both thingID and thingName, not my
> original schema!

Well, writing a trigger to cascade out fido's name change to 'fifi' is
redundant if your model was such that it was un-neccessary.

>
> XDb1 fails in a much more spectacular way, as I'll explain in another
> message.
>

I'll look for that message too.

Have fun.

Ian


--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

thirdrock

unread,
May 21, 2004, 4:05:20 AM5/21/04
to
On Tue, 18 May 2004 11:27:12 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:


>
> Your explanation does raise another question. It looks as if the same
> syntax is used to specify both intension and extension of the model,
> thereby eliminating the distinction between schema and population. My
> guess is that XDb1 would accept the following without complaining:
>
> person isa thing.
> john isa person.
> mary isa person.
> neo isa john.
>
> XDb1 will probably gladly store it - but what does it mean?

Interesting question. It probably means that anything can be anything.
Which might have an application in text/html processing. You could easily
create XML documents and maybe even schema's by extending XDb1. However,
looking at the onerous licensing page on the site, I'm not sure if you
would want to do that.

There might be situations where an untyped database can have
> it's use (I admit that my original choice of words was too harsh), but
> outside of those niches, the schema should be strictly seperate from the
> data and the datatype of the data should be known.

While this is true for structural/relational models, in the OO model you
can enforce the datatype or even data behaviour by using member functions.
By overloading functions or operators in C++ you can compare 7 to eight,
over-the-hill to very-young and anything else you desire. Such
functionality may soon make it's way into SQL server as well, so perhaps
you don't require the data to be strongly typed as long as the operators
on the data members can be overloaded.

Just how this can be done with XDb1, I don't know, as I am concerned about
downloading it due to factors mentioned earlier.

>
> And this is exactly the reason why I'd never use XDb1 for serious work,
> unless I encounter a problem area where the advantages of allowing
> untyped
> data outweigh the disadvantages.

It is said that if all you have is a hammer, everything looks like a nail.
There are a range of applications that require more flexibility than
strongly typed languages provide. Often the language of choice used is
Perl, and those who use it take the work seriously. But is a language a
database, or a database a language? After all, SQL Server stores the data
on disk as binary, whether or not it is integer or string, and it is the
database engine that reads/writes from/to the disk that enforces the
datatype. How is this different from using member functions/overloaded
operators in an object database?

>
> In 99.9% of all applications that store a person's age, comparisons have
> to be made: who is younger than 45 years? Who is older, John, Mary or
> Fido? How can XDb1 (or any other type-less database) anser the last
> question is the user has provided the following input:
>
> over-the-hill isa age.
> very-young isa age.
> 7 isa age.
> john is over-the-hill.
> mary is very-young.
> fido is 7.
>
> Most computers I have used will classify very-young as greater than
> over-the-hill and will refuse to compare either of these to 7.

Er, it's the software (or operating system) that does the
classification/comparison, not the computer. You did mean software right?
And that is not an indication of how computers work, only a limitation of
the current software.

>
> Like I said - there may be specific situations where a product such as
> XDb1 has it's use. But it's not (to quote the web site) "the future of
> databases" - not even remotely!

Maybe not XDb1. After all, the customer service skills shown by Neo so far
precludes that possibility. But Object databases I would not be so quick
to dismiss. XDb1 is the beginnings of an Object database, only it's a
technology that I actually saw more than 15 years ago, and which has
evolved into a product called Cache (www.intersystems.com)

> is not defined for the character domain and would result in an error. I
> know all this, and I can rely on it - if and only if the database ensures
> that only numeric data will be accepted in the Age column and only
> character data in the Color column.

See my comment above about operator overloading. Also, your Colour could
be an RGB value. Where are you then? In weakly typed languages/databases
you can accept black and 0 and white and 16777216 as Colours.

Anyway, have fun :)

Ian

Neo

unread,
May 21, 2004, 3:24:55 PM5/21/04
to
> Starting time Ending time Time Elapsed
> ------------- ------------ ------------
> 14:44:57.670 14:44:57.733 67 ms
> 15:02:26.233 15:02:26.297 64 ms
> 15:07:57.780 15:07:57.843 63 ms

Yesterday's results were all 1st runs after running the provided
script in a new db each time (on NT 4, SQL Server 7, with sp6). Just
to double check, I ran the script again today.

Execute script in existing system tempdb:
Start End Elapsed Comment
------ ------ ------- ----------------
16.860 17.013 153 1st run
03.013 03.047 34 2nd conseq run
31.717 31.750 33 3rd conseq run
48.920 48.950 30 4th conseq run

Drop relevant tables.
Execute script in existing system tempdb:
Start End Elapsed Comment
------ ------ ------- ----------------
06.623 06.780 157 1st run
29.297 30.483 1,186 2nd conseq run (yes 1.186 seconds)
05.577 05.610 33 3rd conseq run

Execute script in brand new Test db:
Start End Elapsed Comment
------ ------ ------- ----------------
06.687 06.733 46 1st run
47.640 47.700 60 2nd conseq run
06.343 06.403 60 3rd conseq run
31.000 31.060 60 4th conseq run

Drop Test db.
Executed script in new Test db:
Start End Elapsed Comment
------ ------ ------- ----------------
32.403 32.450 47 1st run
34.577 34.640 63 2nd conseq run
56.153 56.217 64 3rd conseq run
10.623 10.797 174 4th conseq run

What might I be doing wrong to get results different than yours?

Below are 5 conseq runs with XDb1 printing IDs only, thus
approximating your solution which does not link to other tables in
order to generate the report. Yes this not a fair comparison as char
keys are slower than numerical keys.
It is only meant to be a very rough comparison.
2.23 ms
2.16 ms
2.04 ms
2.18 ms
2.15 ms

Neo

unread,
May 21, 2004, 4:21:24 PM5/21/04
to
> > Among other things, the difference in normalization between the
> > implementations is quite different.
>
> It should not matter HOW my implementation generates the report.

It does matter because a less generic and unnormalized implementation
will cause more problems or even fail under a broader scope.

Here is another example. Suppose the user enters 'john' in tbl_thing
and then enters 'john' in other tables. Later he tries to change
'john' to 'johnny' in tbl_thing. I tried, but I cant because of
constraints to keep redundant data in multiple tables from becoming
unsynchronized (UPDATE statement conflicted with COL REF constraint
'FK_types_thing_0D'. This conflict occured in db 'Test', table,
'types', col 'thing'. Statement terminated.)

In XDb1, user can change 'john' to 'johnny' and the report runs as
before.

Using data itself to link redundant data is a non-generic solution.

Hugo Kornelis

unread,
May 21, 2004, 5:27:31 PM5/21/04
to
On Fri, 21 May 2004 18:05:20 +1000, thirdrock <iktaccounts at optusnet dot
com dot au> wrote:

(snip stuff I agree with)

>> And this is exactly the reason why I'd never use XDb1 for serious work,
>> unless I encounter a problem area where the advantages of allowing
>> untyped
>> data outweigh the disadvantages.
>
>It is said that if all you have is a hammer, everything looks like a nail.
>There are a range of applications that require more flexibility than
>strongly typed languages provide. Often the language of choice used is
>Perl, and those who use it take the work seriously. But is a language a
>database, or a database a language? After all, SQL Server stores the data
>on disk as binary, whether or not it is integer or string, and it is the
>database engine that reads/writes from/to the disk that enforces the
>datatype. How is this different from using member functions/overloaded
>operators in an object database?

I'm afraid I don't see what point you're trying to make.


>> In 99.9% of all applications that store a person's age, comparisons have
>> to be made: who is younger than 45 years? Who is older, John, Mary or
>> Fido? How can XDb1 (or any other type-less database) anser the last
>> question is the user has provided the following input:
>>
>> over-the-hill isa age.
>> very-young isa age.
>> 7 isa age.
>> john is over-the-hill.
>> mary is very-young.
>> fido is 7.
>>
>> Most computers I have used will classify very-young as greater than
>> over-the-hill and will refuse to compare either of these to 7.
>
>Er, it's the software (or operating system) that does the
>classification/comparison, not the computer. You did mean software right?
>And that is not an indication of how computers work, only a limitation of
>the current software.

Yes, I did mean the software. Sorry for the sloppy choice of words.

(snip some more stuff I agree with)


>> is not defined for the character domain and would result in an error. I
>> know all this, and I can rely on it - if and only if the database ensures
>> that only numeric data will be accepted in the Age column and only
>> character data in the Color column.
>
>See my comment above about operator overloading. Also, your Colour could
>be an RGB value. Where are you then? In weakly typed languages/databases
>you can accept black and 0 and white and 16777216 as Colours.

In a relational database, I can define Colour with a datatype that accepts
RGB values instead of english names. If overloading of operators is added
as a feature in the future, you could accept create a datatype colour that
accepts both, that will overload the = operator to return true if 'red' is
compared to 0xFF0000 and that will gladly accept the expression 'green' +
'orange' and return 'yellow' (green = 0x00FF00; orange = 0xFF5000; the
result of adding (mixing) these colors is the bitwise or of these RGB
values: 0xFFFF00, or yellow).

But I (the developer) have to do this. I'll have to write the code for the
overloaded = and + operators. If I do that, I can extend the range of
acceptable values for the "colour" domain to include all English color
names, all 3-byte values from 0x000000 to 0xFFFFFF and all integers that
are a decimal representation of one of those 3-byte values. After that,
the end user will still be unable to enter 'banana' in the color field of
his application screen and expect the software to know that (s)he means a
slightly dark yellow.

Overloading gives the designer the possibility to create lots of extra
possibilities for the user. But the user still can't use things that are
not pre-defined by the designer. That's the point I was trying to make:
XDb1's parser will gladly accept ages of 7, "over the top", "slightly
older than I am" or even "I think therefore I am" without bothering to
check if it has any overloaded operators that are able to do anything
useful with that data.

Hugo Kornelis

unread,
May 21, 2004, 5:50:42 PM5/21/04
to
On Fri, 21 May 2004 17:32:02 +1000, thirdrock <iktaccounts at optusnet dot
com dot au> wrote:

Hi Ian,

>I was cheering you on until you said this ...
>
>>
>> You might be surprised to hear this, but there's a reason that the
>> relational model is called "relational". It's about storing relations.
>
>No, unless I miss-understood what you meant by that. A relation refers to
>the relationship between data members of the same 'table'. A 'table' was
>originally called a relation. Joins between relations in not the reason
>why the relational model is called relational.
>
>Or is that what you said??

Did I write that? (goes back to check) Oh dear, I did.

I meant to write: "It's about storing relationships", but somehow I forgot
to type the hips. Apologies for causing confusion.


>> There would have been redundancy if each object's class was stored along
>> with it's reference, like this:
>>
>> thing class otherThing otherClass hierarchy
>> --------- --------- ---------- ---------- ---------
>> (snip some rows)
>> 'fido' 'dog' 'john' 'person' 'leader'
>> 'fido' 'dog' 'mary' 'person' 'leader'
>> 'fido' 'dog' 'luke' 'person' 'leader'
>
>The above table does not meet the 1N rule, correct?

When you write "the 1N rule", do you mean "1NF" (first normal form), or
are you refering to something else?

AFAIS, this table does not violate 1NF (no repeating groups). It does
violate 2NF (class and otherClass are not fully dependent on the primary
key).


>> This example of a bad design has redundant storage of the same relation.
>
>Er.. actually it's an example of non-normalised data. I don't think design
>has anything to do with it.

Huh? Normalizing *IS* one of the steps you take when you design the schema
of a relational database. And the aim of normalizing is to remove
redundant relationships (not relations) from the design.


>> You seem to be thinking that storing the character strings 'fido' and
>> 'leader' several times is bad.
>
>Bad, or just not normalised?

If I understand Neo correctly, he thinks it is not fully normalised, AND
he thinks that anything that is not fully normalised is bad.


>> This does indeed reduce the number of times you'll see the string 'fido'
>> in the database to 1. But we still have multiple references to fido (note
>> the important distinction between fido (no quotes - a dog) and 'fido'
>> (quotes - a name). We didn't remove anything from the database, we only
>> changed the way things are references and added an extra relation.
>
>True, but if fido goes down to the Names Registry and changes his name to
>'FiFi' in preparation for a sex change operation, then that change is not
>cascaded out to all the records of things. Whereas in your second table,
>it does. So what exactly is your point? Did you say that your first table
>is normalised?

Hmmm, you're right. I did indeed forget to add "on update cascade" and "on
delete cascade" to the "references" clauses in my original message (the
first one I posted in this discussion, four days ago).


>> You claim that my model has redundancy. But if I follow your suggestion
>> of
>> "improving", I end up with the same data in the existing tables and some
>> extra relations that I could do without.
>
>No, you end up with normalised data, which I think is Neo's point (is it
>Neo?)

You end up with data that is no more normalised than you started with, but
that does need extra columns in some tables (or: extra relationships in
the schema).


>> I don't know what definition of "redundant" you've been taught. I think
>> that "Exceeding what is necessary or natural; superfluous." (*) does
>> quite
>> accurately describe the schema with both thingID and thingName, not my
>> original schema!
>
>Well, writing a trigger to cascade out fido's name change to 'fifi' is
>redundant if your model was such that it was un-neccessary.

Why write a trigger if you can use DRI?

>> XDb1 fails in a much more spectacular way, as I'll explain in another
>> message.
>>
>I'll look for that message too.
>
>Have fun.
>
>Ian

Hugo Kornelis

unread,
May 21, 2004, 6:03:09 PM5/21/04
to
On 20 May 2004 15:07:05 -0700, Neo wrote:

Hi Neo,

>> > Changing any one of them would corrupt the data.
>>
>> No. Changing any of them would result in a foreign key violation and the
>> transaction would be rolled back by SQL Server. You didn't miss the
>> "references" clauses in my DDL, did you?
>
>You are correct. I could not change the name of any thing in the first
>two columns of table hierarchies so I could not corrupt the data this
>way; however, this design prevents two things from having the same
>name, thus it is a non-generic solution.

Already addressed elsewhere.


>I did manage to corrupt the data by changing one of the duplicate
>'leader' to 'leader2' in the third column. You missed it here. In
>XDb1, the relator 'leader' is normalized.

Hold on a minute, Neo. Please check what you posted in the message that
started this discussion:

"(...) and the solution must be as generic, meaning allow the user to
create ANY hierarchy, (...)"

My first attempt at your challenge didn't have a table hierarchies, it had
a table leadership (or something like that), with two columns leader and
follower. But when I reread your psot to check if I fulfilled all
requirements, I saw that you wanted a GENERIC solution, that would allow
ANY hierarchy. That's why I renamed the table "hierarchies", added the
hierarchy column and added a parameter to the procedure.

Now, I just tested what would happen if I tried to do the same kind of
thing with the XDb1 solution. I entered "laptop1 leader2 trinity." and got
the following error message: "Invalid relator."

Hmmmmm. I thought the requirement was to "allow the user to create ANY
hierarchy" ????
http://dictionary.reference.com/search?q=any

Hugo Kornelis

unread,
May 21, 2004, 6:36:35 PM5/21/04
to
On 20 May 2004 22:06:44 -0700, Neo wrote:

Hi Neo,

>> > For example, if the user names the first person 'john' and the dog


>> > also 'john', your current solution won't work.
>>
>> Indeed, it won't. And neither will XDb1. You obviously did not test
>> this before you wrote this. Or you are a flat out liar ...
>
>It does work. The steps you took to verify it were inappropriate. The
>easiest way is to use the populated db and rename fido to john.

This is really starting to get funny. You wrote:
"if the user names the first person 'john' and the dog also 'john'"

Now please tell me why a user who wants to name a dog 'john' would first
enter its name as 'fido' and then change the name later?


> To do
>this, navigate to node thing/dog/fido. Click fido twice with a small
>delay between clicks to enter edit mode. Change fido to john and press
>enter. The report is as before except occurrences of 'fido' are
>replaced with 'john'. In this case, printing each thing's class (ie
>'person/john', 'dog/john') would allow user to distinguish them.

(giggle) not exactly, Neo. I got the following when I tried it:

Common Ancestor Report for 'god'
ThingX ThingY CmnAnc Dist
army john army 1
army laptop1 army 2

army john army 2


army mary army 1
army trinity god 2
army luke god 3
john laptop1 john 1

john john john 1
john mary army 2
john trinity god 3
john luke god 4
laptop1 john john 2


laptop1 mary mary 1
laptop1 trinity trinity 2

laptop1 luke trinity 3


john mary mary 1
john trinity trinity 2
john luke luke 1

mary trinity trinity 1
mary luke trinity 2
trinity luke trinity 1

Time elapsed: 16 msec

I especially like the line "john john john 1". Now THAT is the kind of
data representation that a manager likes to base his decisions on, isn't
it?


>The reason your steps did not work is because by modifying the script,
>you created a person named john who you then also classified as a dog.
>You never created a dog to rename.

Of course I didn't. I didn't want to rename a dog, I wanted to enter a dog
named 'john'. Ar rather - you suggested me to try that:

(repeated quote)


>> > For example, if the user names the first person 'john' and the dog

>> > also 'john', (...)

(snip irrelevant stuff about XDb1 internals)

>> I ended up copying and pasting the script line by line,
>> but there's got to be a more efficient way, right?
>
>Sorry, but the ability to process multiple lines is not enabled in the
>release version currently.

Yet another reason to steer clear of this product for any serious use.

>> As I pointed out before
>> - how the data is stored internally is completely irrelevant.
>
>True, but the solution's genericness and level of data normalization
>is relevant because it can limit its scope. For instance the provided
>solution fails when two things have the same name.

1. So does the XDb1 solution.
2. It is completely irrelevant. Please check the subject of this
discussion. You set a challenge. The rules for this challenge were
outlined in this message:

http://groups.google.nl/groups?q=g:thl1201841589d&dq=&hl=nl&lr=&ie=UTF-8&selm=4b45d3ad.0405161027.742aa17d%40posting.google.com

Please indicate which part of that message (or of the one web page that
this message provides a link to) explains the requirement that duplicate
names should not be allowed on entry, but should be allowed by renaming
existing objects at a later time.

I presume you can't. Therefore, I repeat my claim of the $1000 prize.
Please transfer the money to my bank account: International Bank Account


Number (IBAN) NL59 RABO 0118 3365 68 (bank information: BIC/SWIFT address
RABONL2U).

Hugo Kornelis

unread,
May 21, 2004, 7:38:01 PM5/21/04
to
On 20 May 2004 14:22:24 -0700, Neo wrote:

Hi Neo,

>> > > This indicates an elapsed time of less than 3 ms ...


>> >
>> > Among other things, the difference in normalization between the
>> > implementations is quite different.
>> > In an initial attempt to make them more comparable...
>>
>> Now you're changing the rules after the competition has started.
>> Not a sign of good sportsmanship, Neo!
>
>The provided implementation was not similarly generic

Yes, I have to admit that my implementation was not "similarly" generic,
but even MORE generic than the XDb1 implementation, as I didn't limit the
possible hierarchies to a few predefined hierarchy types (as XDb1 does),
but to allow ANY hierarchies (as required in the challenge).

> or normalized.

This argument is getting stale. In another message, I quoted Date's
definition of normal forms, as well as a description of Domain Key Normal
Form (DKNF), said to be <<(theoretically, at least) the "ultimate normal
form">>. Instead of pointing out which of theses normal forms you think my
implementation violates, your message only repeats uninteresting stuff
about XDb1's internals, followed by an unbacked claim that there is
redundancy in my table and that it's not normalized.

Backup your claim, if you can. But please leave XDb1 out of it - the
challenge was to use the relational model, not XDb1's model.


>To make a very rough comparision, I simplified my implementation to
>that which was provided.

Are you referring to this quote from another message you wrote: "I


modified XDb1 algorithm so as to not resolve things' names and simply
prints their IDs."

If so, I don't accept this simplification. The challenge was to create a
report (that contains names) from some input data (that contains names).
There were no IDs in either the input or the report. If there are IDs
under the cover is irrelevant.

If you want to start a new challenge to create a report showing only IDs
from input data containing names, by all means go ahead. Heck, I might
even try and modify my winning entry for this competition to see if I can
rid you of another $1000.


> In general, XDb1 can't compete with RM in the
>scope of a single table's worth of data since XDb1 has the overhead of
>variable storage structures for each table cell. An yes, IDs are
>faster than text so the comparision is far from perfect. XDb1 should
>gain advantage as more tables are joined. The provided solution
>doesn't join any tables since it is less generic and unnormalized.

The provided solution is not less, but even more generic. And it is
completely normalized.

The reason that it doesn't join any tables is that I made proper use of my
RDBMS' tools, instead of the brainless automatic insertion of an
autonumber/identity/guid table_ID column in every table that so many of my
fellow RDBMS developers fall victim to.


>As I said earlier, in the simplfied case, it takes XDb1 2 or 3 ms on a
>500 Mhz, 512MB Dell PowerEdge Server.

As I said earlier, this is irrelevant to the challenge. In the original
form, XDb1 took 16 ms on your machine. Interesting, it also took 16 ms on
my machine (AMD Athlon 1.3 GHz, 256MB)

> I ran the provided SQL Server
>script on the same machine and the report generation takes 65 ms
>(based on the avg of the 3 runs below):
>
>Starting time Ending time Time Elapsed
>------------- ------------ ------------
>14:44:57.670 14:44:57.733 67 ms
>15:02:26.233 15:02:26.297 64 ms
>15:07:57.780 15:07:57.843 63 ms
>
>What might I be doing wrong to get results different than yours?

I assume that you copied and pasted my script straight into Query Analyzer
and executed it without any modification.

It's hard to diagnose the cause of this difference without access to your
computer. But I did think of one thing that might cause this difference.

When I installed SQL Server, I chose the collation "Latin1_General_BIN" as
deafult. That means that character data comparisons are done by straight
comparing of ASCII values. This is the quickest, but it has the
disadvantage that 'john', 'John' and 'JOHN' are treated as three distinct
values. If I wanted to, I could have a person named 'john' and a dog named
'John'.

I don't know what the default collation is, but I do know that it is one
of the case insensitive collations. One that treats 'john', 'John' and
'JOHN' as equal. This will of course introduce some overhead for string
comparisons - which are used quite extensively in my implementation.

Since this is the default collation, I assume that your database will use
case insensitive comparisons. There is a simple way to find out: change
one of the occurences of 'fido' in the insert statements for hierarchies
to 'Fido'. I accept that your database will accept this, showing that it
uses a case sensitive collation. If it is not accepted, your databse uses
a case insensitive collation; it might be a binary collation but it need
not - it might still be accent insensitive or even double-byte character
set!


Another possible explanation might be the other activity on the machine
that runs the SQL Server process. I have SQL Server on my desktop; when I
timed the queries I first shut down all other applications. (SQL Server is
a "friendly" process - it will grab extra memory or processor time if it
needs more, but only if that won't hinder other processes. Reducing other
processes will generally improve SQL Server's performance). Earlier
testing (with several other applications active) resulted in elapsed times
between 10 and 20 ms.

If you have SQL Server running on another server, make sure that server is
dedicated to SQL Server only, if you want the best performance. If you
have SQL Server running on your desktop, close all applications except
Query Analyzer when running this query.

Hugo Kornelis

unread,
May 21, 2004, 7:42:40 PM5/21/04
to
On 21 May 2004 12:24:55 -0700, Neo wrote:

Hi Neo,

>> Starting time Ending time Time Elapsed

>> ------------- ------------ ------------
>> 14:44:57.670 14:44:57.733 67 ms
>> 15:02:26.233 15:02:26.297 64 ms
>> 15:07:57.780 15:07:57.843 63 ms
>
>Yesterday's results were all 1st runs after running the provided
>script in a new db each time (on NT 4, SQL Server 7, with sp6). Just
>to double check, I ran the script again today.
>
>Execute script in existing system tempdb:
>Start End Elapsed Comment
>------ ------ ------- ----------------
>16.860 17.013 153 1st run
>03.013 03.047 34 2nd conseq run
>31.717 31.750 33 3rd conseq run
>48.920 48.950 30 4th conseq run

Executing in tempdb is not a good idea. SQL Server uses tempdb to store
temporary tables and intermediate results. Use a seperate test database.


>Drop relevant tables.
>Execute script in existing system tempdb:
>Start End Elapsed Comment
>------ ------ ------- ----------------
>06.623 06.780 157 1st run
>29.297 30.483 1,186 2nd conseq run (yes 1.186 seconds)
>05.577 05.610 33 3rd conseq run

The sudden boost in execution time for the second run raises the question:
were you the only one using SQL Server at that moment? Were other
processes active on the server. You might have executed simultaneously
with some other process (maybe even one of SQL Server's own housekeeping
processes).

>
>Execute script in brand new Test db:
>Start End Elapsed Comment
>------ ------ ------- ----------------
>06.687 06.733 46 1st run
>47.640 47.700 60 2nd conseq run
>06.343 06.403 60 3rd conseq run
>31.000 31.060 60 4th conseq run
>
>Drop Test db.
>Executed script in new Test db:
>Start End Elapsed Comment
>------ ------ ------- ----------------
>32.403 32.450 47 1st run
>34.577 34.640 63 2nd conseq run
>56.153 56.217 64 3rd conseq run
>10.623 10.797 174 4th conseq run
>
>What might I be doing wrong to get results different than yours?

Hard to diagnose from here, but see my other message as well.

>
>Below are 5 conseq runs with XDb1 printing IDs only, thus
>approximating your solution which does not link to other tables in
>order to generate the report. Yes this not a fair comparison as char
>keys are slower than numerical keys.
>It is only meant to be a very rough comparison.
> 2.23 ms
> 2.16 ms
> 2.04 ms
> 2.18 ms
> 2.15 ms

The execution time with XDb1 printing IDs only is not relevant to this
challenge. The challenge was to generate a report printing names.

Hugo Kornelis

unread,
May 21, 2004, 7:57:23 PM5/21/04
to
On 21 May 2004 13:21:24 -0700, Neo wrote:

Hi Neo,

>> > Among other things, the difference in normalization between the


>> > implementations is quite different.
>>
>> It should not matter HOW my implementation generates the report.
>
>It does matter because a less generic and unnormalized implementation
>will cause more problems or even fail under a broader scope.
>
>Here is another example. Suppose the user enters 'john' in tbl_thing
>and then enters 'john' in other tables. Later he tries to change
>'john' to 'johnny' in tbl_thing. I tried, but I cant because of
>constraints to keep redundant data in multiple tables from becoming
>unsynchronized (UPDATE statement conflicted with COL REF constraint
>'FK_types_thing_0D'. This conflict occured in db 'Test', table,
>'types', col 'thing'. Statement terminated.)

You're right. I had forgotten to include DRI actions when I posted the
script for my implementation. Everywhere I wrote "references things", I
should have written "references things on update cascade on delete
cascade". Due to a silly limitation of SQL Server, this can only be done
for one of the foreign keys in the hierarchies table; I'd have to write a
trigger to get the same result for the other one. (Both the DRI actions
and the trigger won't affect the report writing performance in any way.)

I'm perfectly willing to supply the code for the trigger if necessary to
convince you, but not now - it's already 2 AM over here, so I'm off to
bed.


>In XDb1, user can change 'john' to 'johnny' and the report runs as
>before.
>
>Using data itself to link redundant data is a non-generic solution.

Huh? Please elaborate.

thirdrock

unread,
May 21, 2004, 10:12:34 PM5/21/04
to
On Fri, 21 May 2004 23:50:42 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:


>
> I meant to write: "It's about storing relationships", but somehow I
> forgot
> to type the hips. Apologies for causing confusion.

No problemo!


>
>
>>> There would have been redundancy if each object's class was stored
>>> along
>>> with it's reference, like this:
>>>
>>> thing class otherThing otherClass hierarchy
>>> --------- --------- ---------- ---------- ---------
>>> (snip some rows)
>>> 'fido' 'dog' 'john' 'person' 'leader'
>>> 'fido' 'dog' 'mary' 'person' 'leader'
>>> 'fido' 'dog' 'luke' 'person' 'leader'
>>
>> The above table does not meet the 1N rule, correct?
>
> When you write "the 1N rule", do you mean "1NF" (first normal form), or
> are you refering to something else?

I've seen it written both ways in the texts, but I am aware of the
importance of having a common set of terms, otherwise we end up like the
queen of hearts in Through the Looking Glass.

>
> AFAIS, this table does not violate 1NF (no repeating groups). It does
> violate 2NF (class and otherClass are not fully dependent on the primary
> key).

Yes, you are correct.


>
>
>>> This example of a bad design has redundant storage of the same
>>> relation.
>>
>> Er.. actually it's an example of non-normalised data. I don't think
>> design
>> has anything to do with it.
>
> Huh? Normalizing *IS* one of the steps you take when you design the
> schema
> of a relational database. And the aim of normalizing is to remove
> redundant relationships (not relations) from the design.

True. However, sometimes your design will include data that is not
normalised for reasons of avoiding unneccessary overhead or to produce
certain types of behaviour (which I won't go into in this post).
Therefore, non-normalised data *can* be poor design, but it not
neccessarily so. It really depends on the context.
My point is (and I do have one) that the generalisation that
non-normalised data is bad design is not always true, and in fact is
probably the wrong way around. Good relational design will, as a general
rule, produce normalised data, but the inverse is probably not generally
true.


>
>
>>> You seem to be thinking that storing the character strings 'fido' and
>>> 'leader' several times is bad.
>>
>> Bad, or just not normalised?
>
> If I understand Neo correctly, he thinks it is not fully normalised, AND
> he thinks that anything that is not fully normalised is bad.

Which you don't, I assume, so we are in agreement.

> Hmmm, you're right. I did indeed forget to add "on update cascade" and
> "on
> delete cascade" to the "references" clauses in my original message (the
> first one I posted in this discussion, four days ago).

Better add that to collect the 1000 bucks :)

>
>
>>> I don't know what definition of "redundant" you've been taught. I think
>>> that "Exceeding what is necessary or natural; superfluous." (*) does
>>> quite
>>> accurately describe the schema with both thingID and thingName, not my
>>> original schema!
>>
>> Well, writing a trigger to cascade out fido's name change to 'fifi' is
>> redundant if your model was such that it was un-neccessary.
>
> Why write a trigger if you can use DRI?

Exactly! Therefore my statement about it being redundant holds true, yes?


>
> Best, Hugo

Have fun,

thirdrock

unread,
May 21, 2004, 10:22:42 PM5/21/04
to
On Fri, 21 May 2004 23:27:31 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

> On Fri, 21 May 2004 18:05:20 +1000, thirdrock <iktaccounts at optusnet
> dot
> com dot au> wrote:
>
> (snip stuff I agree with)
>
>>> And this is exactly the reason why I'd never use XDb1 for serious work,
>>> unless I encounter a problem area where the advantages of allowing
>>> untyped
>>> data outweigh the disadvantages.
>>
>> It is said that if all you have is a hammer, everything looks like a
>> nail.
>> There are a range of applications that require more flexibility than
>> strongly typed languages provide. Often the language of choice used is
>> Perl, and those who use it take the work seriously. But is a language a
>> database, or a database a language? After all, SQL Server stores the
>> data
>> on disk as binary, whether or not it is integer or string, and it is the
>> database engine that reads/writes from/to the disk that enforces the
>> datatype. How is this different from using member functions/overloaded
>> operators in an object database?
>
> I'm afraid I don't see what point you're trying to make.

That there are a range of problems where allowing untyped data outweigh
the disadvantages.

>
>


>>> is not defined for the character domain and would result in an error. I
>>> know all this, and I can rely on it - if and only if the database
>>> ensures
>>> that only numeric data will be accepted in the Age column and only
>>> character data in the Color column.
>>
>> See my comment above about operator overloading. Also, your Colour could
>> be an RGB value. Where are you then? In weakly typed languages/databases
>> you can accept black and 0 and white and 16777216 as Colours.
>
> In a relational database, I can define Colour with a datatype that
> accepts
> RGB values instead of english names.

Yes, but CURRENTLY you cannot accept BOTH. The same is true of all
strongly typed languages, not just relational databases.

>
> But I (the developer) have to do this. I'll have to write the code for
> the
> overloaded = and + operators.

Someone always has to write the code to do this, be it the end developer
or the vendor of the database. Just because RDBMS exist where the code to
enforce the data type exist doesn't mean that some developer didn't write
the code to do that.


> Overloading gives the designer the possibility to create lots of extra
> possibilities for the user. But the user still can't use things that are
> not pre-defined by the designer. That's the point I was trying to make:
> XDb1's parser will gladly accept ages of 7, "over the top", "slightly
> older than I am" or even "I think therefore I am" without bothering to
> check if it has any overloaded operators that are able to do anything
> useful with that data.
>

Well, is that correct? If I can subclass or extend the functionality of
the XDb1 functions (which by the way I don't know whether or not it is
possible), then I can add that type checking myself.
There are C++ bindings to the XDb1 API ...

Or can I ?

thirdrock

unread,
May 21, 2004, 10:57:46 PM5/21/04
to
On Tue, 18 May 2004 22:46:30 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:


>
> Note also that MS SQL Server, the RDBMS I do most work on at the moment,
> supports a datatype "sql_variant", that will store basically everything.
> I
> have never used this datatype, not have I ever seen a question in any SQL
> Server related newsgroup (and I participate in quite a few of them) where
> this datatype would be the answer. And I expect I never will.
>
I wouldn't use it either, but not for the same reason. Every variant I
have ever used from Microsoft has been a baroque piece of shit.
There are third party implementations of variants for C++ that are
extremely usefull. And of course you have unions that allow you to use a
single structure for any type of data (something that MS uses inside SQL
server).
I suppose at the 4GL there are very few applications of variants, but at
the lower level, building RDBMSs for example, they are totally neccessary.

Cheers,

Neo

unread,
May 22, 2004, 12:55:39 AM5/22/04
to
> You obviously didn't check my model too well. If you really want to use a
> tree-like notation to show how my model stores thing (which can only be
> achieved by stretching things very far, since I didn't use a tree-like
> model to begin with)...

You stored the following data in a table named things:
god
army
trinity
john
mary
luke
fido
laptop1

The relationship between a table and its records is that of class and
it's instances. The provided table above is equivalent to:

god isa thing.
army isa thing.
trinity isa thing.
john isa thing.
mary isa thing.
luke isa thing.
fido isa thing.
laptop1 isa thing.

which doesn't match the original structure. Thus the provided
solution, has no way to determine that john is a person and that fido
is a dog. Only that they are both things, which does not allow one to
print differentiating class info on the report if both were named the
same. Its not a big problem, I know you could fix it.

>[thing] -- Not stored. Since everything is a thing be definition,
> -- there's no need to explicitly store that information.
> -- Some would even call it redundant.

But you did store it. It is stored as the name of the table. Being
explict is not equivalent to being redundant.

> [thing]

> god -- Leaf level, therefore it's a thing.
> person -- Not leaf level, therefore it's a class.
> john -- Leaf level, therefore it's a thing.
> mary -- Leaf level, therefore it's a thing.
> dog -- Not leaf level, therefore it's a class.
> fido -- Leaf level, therefore it's a thing.
>
> As you can see, apart from removing the redundant storing that everything
> is a thing, there is no difference.

There is a difference. In the diagram above, the indentation of john
and mary under person indicates john and mary are persons, and the
indentation of person under thing indicates person isa thing. These
relationship can't be derived from your db schema. In XDb1's db, 'john
isa thing' is not stored because it can be derived. Since 'isa' is
transitive 'john isa thing' is derived form 'john isa person' and
'person isa thing'. While it is true that everything isa thing, it is
also true that only some things are persons and only some things are
dogs. The provided db schema doesn't allow determination of the later.

> >Thus, if a person is named 'skippy' and a dog is also named 'skippy',
> >and we print a thing's class to distinguish them on the report, XDb1
> >would print:
> > person/skippy
> > dog/skippy
>
> No, it wouldn't.

By 'would', I did not mean to imply that it currently does, but that
it could. The point being that XDb1's db has the proper schema/data to
allow an application to print the distinguishing class info. In the
provided schema/data, currently that is not possible.

> >whereas the provided implementation would print:
> > thing/skippy
> > thing/skippy
>
> No it wouldn't. It returns an error:

Yes, I realize the provided solution doesn't allow two things with the
same name. I was just anticipating the next problem due to its
non-generic design.

Hugo Kornelis

unread,
May 22, 2004, 8:09:23 AM5/22/04
to
On Sat, 22 May 2004 12:22:42 +1000, thirdrock <iktaccounts at optusnet dot
com dot au> wrote:

Probably. I didn't encounter these problems yet, but I'm well aware that
my personal experience alone proves nothing.

(snip stuff I agree with)

>> Overloading gives the designer the possibility to create lots of extra


>> possibilities for the user. But the user still can't use things that are
>> not pre-defined by the designer. That's the point I was trying to make:
>> XDb1's parser will gladly accept ages of 7, "over the top", "slightly
>> older than I am" or even "I think therefore I am" without bothering to
>> check if it has any overloaded operators that are able to do anything
>> useful with that data.
>>
>Well, is that correct? If I can subclass or extend the functionality of
>the XDb1 functions (which by the way I don't know whether or not it is
>possible), then I can add that type checking myself.
>There are C++ bindings to the XDb1 API ...
>
>Or can I ?

Frankly, I don't know. Neo should know - after all, he is XDb1's lead
developer.

Hugo Kornelis

unread,
May 22, 2004, 8:20:47 AM5/22/04
to
On Sat, 22 May 2004 12:12:34 +1000, thirdrock <iktaccounts at optusnet dot
com dot au> wrote:

(big snip)


>>>> This example of a bad design has redundant storage of the same
>>>> relation.
>>>
>>> Er.. actually it's an example of non-normalised data. I don't think
>>> design
>>> has anything to do with it.
>>
>> Huh? Normalizing *IS* one of the steps you take when you design the
>> schema
>> of a relational database. And the aim of normalizing is to remove
>> redundant relationships (not relations) from the design.
>
>True. However, sometimes your design will include data that is not
>normalised for reasons of avoiding unneccessary overhead or to produce
>certain types of behaviour (which I won't go into in this post).
>Therefore, non-normalised data *can* be poor design, but it not
>neccessarily so. It really depends on the context.
>My point is (and I do have one) that the generalisation that
>non-normalised data is bad design is not always true, and in fact is
>probably the wrong way around. Good relational design will, as a general
>rule, produce normalised data, but the inverse is probably not generally
>true.

Yes, you are correct. There may be very valid reasons for violoating
normalization rules in the design of a relation database. A testbook I
once read made a distiction between DEnormalized design (where the data is
normalized first, then denormalized if there is a valid reason to do so)
and UNnormalized design (where the designer never bothered to normalize):
the UNnormalized design might even be exactly the same as the DEnormalized
design, but the former only got it right "by accident".

This is of course a generalization (really good designers might be able to
produce the correct design directly, without writing out the normalized
design first), but it does show that violating normalization is not wrong
in itself, but only if there is a good reason to do so. Never without
thinking it over and consciously making the decision to denormalize.


>>>> You seem to be thinking that storing the character strings 'fido' and
>>>> 'leader' several times is bad.
>>>
>>> Bad, or just not normalised?
>>
>> If I understand Neo correctly, he thinks it is not fully normalised, AND
>> he thinks that anything that is not fully normalised is bad.
>
>Which you don't, I assume, so we are in agreement.

I don't. We are.


>> Hmmm, you're right. I did indeed forget to add "on update cascade" and
>> "on
>> delete cascade" to the "references" clauses in my original message (the
>> first one I posted in this discussion, four days ago).
>
>Better add that to collect the 1000 bucks :)
>
>>
>>
>>>> I don't know what definition of "redundant" you've been taught. I think
>>>> that "Exceeding what is necessary or natural; superfluous." (*) does
>>>> quite
>>>> accurately describe the schema with both thingID and thingName, not my
>>>> original schema!
>>>
>>> Well, writing a trigger to cascade out fido's name change to 'fifi' is
>>> redundant if your model was such that it was un-neccessary.
>>
>> Why write a trigger if you can use DRI?
>
>Exactly! Therefore my statement about it being redundant holds true, yes?

Writing a trigger if DRI actions can be used is redundant. If, for
whatever reason, DRI actions can't be used writing a trigger makes perfect
sense.

Hugo Kornelis

unread,
May 22, 2004, 9:19:40 AM5/22/04
to
On 21 May 2004 21:55:39 -0700, Neo wrote:

>> You obviously didn't check my model too well. If you really want to use a
>> tree-like notation to show how my model stores thing (which can only be
>> achieved by stretching things very far, since I didn't use a tree-like
>> model to begin with)...
>
>You stored the following data in a table named things:
> god
> army
> trinity
> john
> mary
> luke
> fido
> laptop1

Of course I did. That's how the relational model works.


>The relationship between a table and its records is that of class and
>it's instances.

You're right, but that's the internals of a database. What matters is the
semantics of my implementation, not the internals. I'm well aware that
confusion may arise from the fact that the word "class" in your quote
above is the same as the word "class" in messages about the semantics of
the challenge, yet they are completely distinct things.

I've worked on projects that required me to switch from the problem level
to the meta-level (where you can have a table named columns and a column
named tablename) and even to the meta-metalevel (where the one of the
values in the column tablename is column). Gives you major headaches in
the beginning, but you get used to it - the headaches get less, though
they never disappear completely <g>.


> The provided table above is equivalent to:
>
> god isa thing.
> army isa thing.
> trinity isa thing.
> john isa thing.
> mary isa thing.
> luke isa thing.
> fido isa thing.
> laptop1 isa thing.
>
>which doesn't match the original structure.

Of course not. It's a relational structure, the original structure was an
XDb1 kind of structure. But the structure is not relevant, the semantics
are. And that's where we're getting with your next sentence:

> Thus the provided
>solution, has no way to determine that john is a person and that fido
>is a dog.

SELECT things.thing, types.type
FROM things
LEFT OUTER JOIN types
ON types.thing = things.thing

Wow! I never thought that I would have to teach XDb1's lead developer how
to write a join.

(When I chose the names "types" and "type" for that table and column, I
didn't yet know that you thought of it as classes. If I were to rewrite my
implementation now, I'd choose to call the table classes and the column
class).


> Only that they are both things, which does not allow one to
>print differentiating class info on the report if both were named the
>same.

Two things can't be called the same. Why do I need to repeat this over and
over again?

> Its not a big problem, I know you could fix it.

It's not a problem. It doesn't need fixing.


>>[thing] -- Not stored. Since everything is a thing be definition,
>> -- there's no need to explicitly store that information.
>> -- Some would even call it redundant.
>
>But you did store it. It is stored as the name of the table.

The table name is part of the intension, not part of the extension.

>Being
>explict is not equivalent to being redundant.

In this case it is.


>> [thing]
>> god -- Leaf level, therefore it's a thing.
>> person -- Not leaf level, therefore it's a class.
>> john -- Leaf level, therefore it's a thing.
>> mary -- Leaf level, therefore it's a thing.
>> dog -- Not leaf level, therefore it's a class.
>> fido -- Leaf level, therefore it's a thing.
>>
>> As you can see, apart from removing the redundant storing that everything
>> is a thing, there is no difference.
>
>There is a difference. In the diagram above, the indentation of john
>and mary under person indicates john and mary are persons, and the
>indentation of person under thing indicates person isa thing. These
>relationship can't be derived from your db schema.

It can, by joining the types table to the things table. Even though SQL is
not the best tool for formatting, I can even produce the indented report
from the data in my tables:

select case
when thing is null then type
else case
when type is not null then ' '
else ''
end + coalesce(thing, '')
end
from
(select types.type, things.thing
from things
left join types on types.thing = things.thing
union
select types.type, null
from types) as derivedtable
order by type, thing


> In XDb1's db, 'john
>isa thing' is not stored because it can be derived. Since 'isa' is
>transitive 'john isa thing' is derived form 'john isa person' and
>'person isa thing'.

In my database, 'john isa thing' is not stored either, because it can be
derived. Everything is a thing.


> While it is true that everything isa thing, it is
>also true that only some things are persons and only some things are
>dogs. The provided db schema doesn't allow determination of the later.

See above.


>> >Thus, if a person is named 'skippy' and a dog is also named 'skippy',
>> >and we print a thing's class to distinguish them on the report, XDb1
>> >would print:
>> > person/skippy
>> > dog/skippy
>>
>> No, it wouldn't.
>
>By 'would', I did not mean to imply that it currently does, but that
>it could. The point being that XDb1's db has the proper schema/data to
>allow an application to print the distinguishing class info. In the
>provided schema/data, currently that is not possible.

To summarize:
* At present, my relational model does not allow two things to have the
same name; the XDb1 model does allow this but you have to use an
undocumented "feature" to enter your data and your reports are instantly
rendederd worthless.
* If needed, my relational model can easily be adapted to allow two or
more things to have the same name and enhance the reports so as to still
be meaningful, whereas users of XDb1 will have to wait until XDb1's
developers get around to changing this and are stuck in the meantime.

This once again proves that my relational implementation is indeed not "as
flexible" as the XDb1 implementation, but *more* flexible.

If the original message had indicated that different things might have the
same name, I would of course have incorporated that in my design. But
seeing how XDb1 fails to address this issue correctly, I can see why you
did not include that in your original post.


>> >whereas the provided implementation would print:
>> > thing/skippy
>> > thing/skippy
>>
>> No it wouldn't. It returns an error:
>
>Yes, I realize the provided solution doesn't allow two things with the
>same name. I was just anticipating the next problem due to its
>non-generic design.

It's actually quite easy to add the class to the report, if that's what
you want - just left join the types table. But your original post and the
web page mentioned therein didn't ask for inclusion of the class on the
report.

This has of course nothing to do with allowing or disallowing things with
the same name. I don't know why you're trying to create a connection here.

Neo

unread,
May 17, 2004, 11:44:27 AM5/17/04
to
> Relational may not do everything well, just an awful lot well.

I agree with your point that presently the relational model is
probably the best overall solution to many common applications.

> The real world is not tree-shaped for the most part.

I agree with your point and XDb1 is not based on lists, tables or
trees; however, it can represent them. XDb1 is a limited implemenation
of an alternate data model.

Neo

unread,
May 22, 2004, 2:48:57 PM5/22/04
to
> > change 'john' to 'johnny' ... cant because of constraints...

>
> I had forgotten to include DRI actions when I posted the
> script for my implementation. Everywhere I wrote "references things", I
> should have written "references things on update cascade on delete
> cascade". I'm perfectly willing to supply the code for the trigger
> if necessary to convince you.

No need to supply the code. I accept that the proper DRI would change
'john' to 'johnny' in all remaining tables that used 'john' and I
accept that the chance of future programmers forgetting to add the
necessary DRIs for new related tables is very low.

> >Using data itself to link redundant data is a non-generic solution.
>
> Huh? Please elaborate.

Suppose your things table allows two things to have the same name, ie
'john' and 'john' (to approximate the genericness of the db created
with XDb1). In things table, each 'john' is a different record (and
hopefully there are enough other attributes to distinguish them; if
not, an id column might be added).

Now suppose you have a 2nd table called hierarchies and wish to link
its parent column with rows from things table. In this case, the data
(ie 'john') could not be used to perform the link, because it would be
ambigious. Thus, using data itself to link redundant data is a
non-generic solution.

Hugo Kornelis

unread,
May 22, 2004, 3:43:34 PM5/22/04
to

Hi Neo,

If the requirements were such that two things might have the same name, my
design would have been different. If the requirements are that two things
can have the same name but only if they are of different classes, I could
have used thing+class as primary key for the things table; this would also
have introduced extra columns in related tables. Or I could have chosen to
use a system-generated ID instead. In a real applications, such decisions
are made so as to get the best performance for typical usage patterns.

If the requirements had even allowed two things of the same class to share
the same name (e.g. two persons named 'john'), I would have asked for more
information, because an RDBMS requires a possibility to identify each
individual row in a table. (I don't know if XDb1 has such a requirement; I
hope it does, because there is not much use in storing data about things
if you have no way of knowing WHICH thing you're storing data about - if
there are two persons named john in my office, what use is the statement
'john is aged 35' to me?)

Neo

unread,
May 22, 2004, 4:50:44 PM5/22/04
to
> > Among other things, the difference in normalization between the
> > implementations is quite different.
>
> It should not matter HOW my implementation generates the report.

It matters because a non-generic/unnormalized implementation
is more likely to experience problems over a broader scope. Here is
another example, consider what happens when a thing has no name. Since
the provided solution specifically uses a thing's name to represent it
and link between tables, the provided solution will fail to represent
a non-named thing (ie a person whose name is unknown or doesn't have
one).

On the other hand, the XDb1 db's schema (created by the simple
english-like script) allows a thing to have 0 to many names. In XDb1's
db, the name is an attribute of a thing, the name isn't the thing
itself. The easiest way to check this functionality is to unname john
in a completed db. To unname john, click thing/person/john twice with
a small delay to enable edit mode. Clear out 'john' and press enter.
Note, when a thing has no name, XDb1 displays it's secondary classes,
if any and its properties, if any, thus after unnaming john, one will
see 35, his age. In addition, the report will run as before, except
'john' will be replaced with 'person 35', the unnamed things classes
and properties.

If you insist that unnaming a thing is not the same as creating an
unnamed thing in the first place, you can create the db with the
original script, except replace all instances of the word 'john' with
'*'. XDb1's NLI takes '* isa person' to mean, create an unnamed
instance of person class. In succeeding statements, '*' continues to
refer to the last unnamed thing created (XDb1's NLI is currently far
from complete and its GUI or API is required for many operations).

The main point I wanted to emphasize is not what the GUI has or hasn't
implemented, but the broader scope afforded by representing data with
high degree of flexibility and normalization (replacing redundant
things with a suitable ref to the original; things' names are not
suitable refs).

Neo

unread,
May 22, 2004, 6:40:42 PM5/22/04
to
> Now please tell me why a user who wants to name a dog 'john' would first
> enter its name as 'fido' and then change the name later?

You are technically correct. Here are the steps to enter a dog whose
name is 'john' right from the start of his instantiation.

god isa thing.
force isa thing.
army isa force.
church isa thing.
trinity isa church.


person isa thing.
john isa person.
mary isa person.

luke isa person.
age isa thing.
35 isa age.
john is 35.
weight isa thing.
130 isa weight.
mary is 130.
color isa thing.
red isa color.
luke is red.
dog isa thing.
* isa dog.
(Note: '*' tells XDb1's to create an unnamed instance and '*' can
be used in later statements to refer to the last unnamed instance.
To name the new instance, click on it twice with a small delay
to enter edit mode. Type in 'john' and press enter.
Note2: The version you have has a bug and does not process
"*'s name is john." correctly. It has been corrected in newer ver.)
computer isa thing.
laptop1 isa computer.
army leader god.
trinity leader god.
john leader army.
mary leader army.
mary leader trinity.
luke leader trinity.
laptop1 leader john.
laptop1 leader mary.
* leader john.
* leader mary.
* leader luke.

Currently XDb1's NLI is quite incomplete. For many operations, user
must use GUI or API. My goal isn't to demo NLI/GUI's limited strengths
and many weakness, but that data can be represented in a highly
normalized and generic manner and that it can be processed quickly
especially when the equivalent data/schema is spread out over several
tables in RM.

Thus, while the db created in XDb1 allows things to have the same name
(right from the start) and the provided db doesn't, it is not as
generic of a solution. But I know you could make it more generic so a
fairer comparision can be made.

Neo

unread,
May 22, 2004, 7:28:42 PM5/22/04
to
> > The report is as before except occurrences of 'fido' are
> > replaced with 'john'. In this case, printing each thing's class (ie
> > 'person/john', 'dog/john') would allow user to distinguish them.
>
> (giggle) not exactly, Neo. I got the following when I tried it...

> I especially like the line "john john john 1".

In general, there is roughly an app layer and a data layer. If the app
has to change or extend its logic in order to meet new conditions
(like adding a thing's class to the report, that in general is
acceptable to me. But what is not as acceptable is either a failure to
model the data correctly (ie the class hierarchy) and/or a schema that
is non-generic (ie. two things can't have same name or inability to
handle unamed things). The db implemented with XDb1 allows each thing
to have 0 to many names, and any two things can have the same name,
and the class hierarchy is explicit and can be of any depth.

After changing fido's name to john or creating a dog named john (the
same name as that of the 1st person), the report runs "correctly" in
the sense that it is the same as before except the word 'john' now
also appears where 'fido' was before. Thus the prior output 'john fido
john 1' became 'john john john 1'. Of course, now the problem is that
the user can't distinguish john the person from john the dog, until
the app adds that info. Previously, when I stated that XDb1 would
print 'person/john' and 'dog/john', I meant XDb1 would be able to do
it by altering the app layer but without adding additional data or
modifying db schema, which is not the case with the provided solution,
since it does not model the class hierarchy correctly (it fails to
implicity store the class of 1st-level things such as person, dog,
computer and it fails to allow deeper class hierarchies. See other
thread for details).

> Please indicate which part of that message explains the requirement
> that duplicate names should not be allowed on entry...

The part that says "Report from normalized and NULL-less data and the
solution must be as generic". For starters, a solution that can't
handle multiple things with the same name is not as generic. But I
know you could make your solution more generic to allow a fairer
comparision.

Nick Landsberg

unread,
May 22, 2004, 8:13:06 PM5/22/04
to

[ Everything Snipped ]

OK folks, I have found this discussion fascinating
and educational. However I would like to add
one more item into the equation, and it is not
theory (although I notice that c.d.theory is
in the distribution list) ---

Given the possible implementations, how long
would this query run on a 1 million record
database? (That's actually small by today's standards)
And then for a 10 million record database.

Pick your own hardware, e.g. raid arrays,
2.5 GHZ CPU's etc., but tell me how long
this query would run, please.

Theory is nice, but, in my experience,
performance is what sells systems to
customers.

--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch

Neo

unread,
May 22, 2004, 8:31:56 PM5/22/04
to
> "(...) and the solution must be as generic, meaning allow the user to
> create ANY hierarchy, (...)" Now, I just tested what would happen if
> I tried to do the same kind of thing with the XDb1 solution.
> I entered "laptop1 leader2 trinity." and got the following error
> message: "Invalid relator."

The error message is correct in the sense that XDb1 does not know any
relator named "leader2". It needs to be "taught" new relators and one
can't teach it simply by using it in a sentence. Thus far you have
mostly seen 'X isa thing' or 'Y is red'. XDb1 knows these and a few
others because the db was "taught" these during its creation. User can
"teach" ANY relator, but currently it requires the API and some work
is required to make it user-friendly. Which relator would you like me
to add to the downloadable db at the website? Thus the provided
solution is more generic in relation to hierarchies. However,
currently the provided solution doesn't properly model the most
important hierarchy, the class hierarchy. In XDb1, classes can form a
hierarchy. This is not possible in the provided solution as the
hierarchies table is not used.

Neo

unread,
May 22, 2004, 8:54:14 PM5/22/04
to
> I saw that you wanted a GENERIC solution, that would allow
> ANY hierarchy. That's why I renamed the table "hierarchies", added the
> hierarchy column and added a parameter to the procedure.

Actually the provided solution is inconsistent in that the class
hierarchy of things, the most important one, isn't encoded in the
hierarchies table. It is instead encoded in a dedicated table named
types. The procedure that generates the report utilize hierarchies
table and not the types table. And if one were to encoded the class
hierarchy in hierarches table also, it would be redundant and possibly
different.

Hugo Kornelis

unread,
May 23, 2004, 4:28:48 PM5/23/04
to
On 22 May 2004 17:31:56 -0700, Neo wrote:

>> "(...) and the solution must be as generic, meaning allow the user to
>> create ANY hierarchy, (...)" Now, I just tested what would happen if
>> I tried to do the same kind of thing with the XDb1 solution.
>> I entered "laptop1 leader2 trinity." and got the following error
>> message: "Invalid relator."
>
>The error message is correct in the sense that XDb1 does not know any
>relator named "leader2". It needs to be "taught" new relators and one
>can't teach it simply by using it in a sentence. Thus far you have
>mostly seen 'X isa thing' or 'Y is red'. XDb1 knows these and a few
>others because the db was "taught" these during its creation. User can
>"teach" ANY relator, but currently it requires the API and some work
>is required to make it user-friendly. Which relator would you like me
>to add to the downloadable db at the website?

No, don't bother. If I want to add a new relator, I'll just enter it to
the hierarchies table in my relational design. That's a lot easier than
going back to the database vendor every time I need a new hierarchy :-)


> Thus the provided
>solution is more generic in relation to hierarchies. However,
>currently the provided solution doesn't properly model the most
>important hierarchy, the class hierarchy. In XDb1, classes can form a
>hierarchy. This is not possible in the provided solution as the
>hierarchies table is not used.

Oh, so THAT is what you want! Why didn't you say so directly from the
start? Because, you see, your original post failed to point this out.

But since I'm in a good mood, I'll provide you with the necessary change
free of charge (note that I changed the name of the types table to
classes, if you didn't change "references classes" to "references types")

CREATE TABLE ClassHierarchy (SubClass varchar(20) not null
references classes
on update cascade
on delete cascade,
SuperClass varchar(20) not null
references classes
on update cascade
on delete cascade,
primary key(SubClass, SuperClass))

This is assuming that a class can be a subclass of multiple classes; if
not, change the primary key to cover only the SubClass column. I think you
can work out the INSERT statements for yourself.

Also, since neither the class of a thing, nor it's classes' position in
the class hierarchy is needed for generating the Common Ancestors Report,
you can expect performance for generating it to remain unchanged.

Hugo Kornelis

unread,
May 23, 2004, 4:36:04 PM5/23/04
to
On 22 May 2004 17:54:14 -0700, Neo wrote:

>> I saw that you wanted a GENERIC solution, that would allow
>> ANY hierarchy. That's why I renamed the table "hierarchies", added the
>> hierarchy column and added a parameter to the procedure.
>
>Actually the provided solution is inconsistent in that the class
>hierarchy of things, the most important one, isn't encoded in the
>hierarchies table. It is instead encoded in a dedicated table named
>types.

Yes, indeed. That's because I normalized the data (as requested in the
original message). Note: see my other message as well, where I provide an
additional table to keep track of the class hierarchy.


> The procedure that generates the report utilize hierarchies
>table and not the types table.

Of course. If you'd replace the statement 'john isa person' by 'john isa
dog' but leave all 'xxx leader yy' statements intact, the Common Ancestor
Report would remain unchanged. This report is based on the leader
hierarchies, disregarding the class of a thing. Why would I use a table in
my query if it doesn't have any effect on the requested output?

> And if one were to encoded the class
>hierarchy in hierarches table also, it would be redundant and possibly
>different.

I'm not sure if I understand what you try to say here.

Neo

unread,
May 23, 2004, 4:49:30 PM5/23/04
to
> OK folks, I have found this discussion fascinating and educational.
> Given the possible implementations, how long would this query run
> on a 1 million record database?

It could take hours, days, months, maybe even years, depending on the
number of interconnections between things (I believe the human brain
performs somewhat similar computations but in a massively parallel
manner)!

Sorry, but XDb1 is not suitable for commerical applications yet. It is
currently mostly an experimental db meant to test out an alternate
data model. IMO, an apple-to-apple comparision hasn't been made yet
because of the following differences:

The provided RM db's schema is not as generic as XDb1's in the
following respects:
1. It does not model a thing's name as an attribute.
2. It does not model attributes with multiple values in a normalized
manner.
3. It requires a separate table for values of each each data type!!!
4. It does not model the class hierarchy in the appropriate table thus
making it inaccessible for reporting.
5. It does not model a way to normalize names of things. (ie
person/john and dog/john)
6. It does not model a way to normalize parts of a name (ie 'simth' in
'john smith' and 'mary smith')
7. It does not model a way to normalize symbols in names. (ie 'o' in
'john' and 'god')

Also the provided RM db's data is not as normalized as XDb1's. RM's
has duplicate data which needs to be replaced with proper refs (keys)
to the one and original data within the db. A proper ref must be
independent of data being represented, otherwise problems arise. For
example, in RM's, two things can't have the same name and can't
properly represent an unnamed thing.

Some differences between XDb1's and RM's reporting algorithms:
1. RM's report is more complete. It reports all of the nearest common
ancestors, where as XDb1's only reports any one the nearest common
ancestors.

2. RM's report algorithm doesn't access the atttributes of any thing,
where as XDb1's can. For instance, if a thing is unnamed, XDb1 prints
the thing's classes and attributes!!!

3. XDb1's algorithm is extremely memory efficient. The major variable
in calculating memory requirements are 3 integer arrays that must be
larger than the depth of the hierarchy, regardless of the number of
things in the hierarchy. Thus for a hierarchy 100 deep, XDb would need
300 integers for these arrays. RM's creates an intermediate table
whose size can be much much larger since there can be nearly an
infinite number of things in a 100-depth hierarchy!

With a very small dataset, on a off-line, 500 Mhz, Dell PowerEdge, 512
MB, Ultra SCSC II, 10,000 RPM HDs, NT 4 sp6a, SQL Server 7: RM's
solution takes 65 ms on avg, but as low as 30ms and as high as 1186
ms. XDb1's simplified solution (only printing thing's ID) which very
roughly approximates RM's solution (prints char-based keys which
happen to encode thing's name also), takes as high as 2.23 ms. XDb1's
generic solution (which is normalized down to atomic symbols and
accesses the same symbol 'o' in the db when printing 'john' and 'god'
on the report and will print an unnamed thing's classes and
properties), takes 2 or 3 ms.

Summary of processing time for a small data set (8 things):
RM's simple solution: 65 ms (typical)
XDb1's simpler solution: 2.23 ms (max)
XDb1's generic solution: 2 or 3 ms (typical)

On a larger hierarchy consisting of 200 goats (5 generations x 40
goats/generation, each goat having two parents):
RM's simple solution: 11 sec (lowest of 11,16,15,15,14,11,13,14,14,13)
XDb1's generic solution: 5.438 sec (only 1 run thus far with beta
v4.5.0)

Please note: the above are apple-to-orange comparisions.

Hugo Kornelis

unread,
May 23, 2004, 6:15:11 PM5/23/04
to
On 22 May 2004 13:50:44 -0700, Neo wrote:

>> > Among other things, the difference in normalization between the
>> > implementations is quite different.
>>
>> It should not matter HOW my implementation generates the report.
>
>It matters because a non-generic/unnormalized implementation
>is more likely to experience problems over a broader scope. Here is
>another example, consider what happens when a thing has no name. Since
>the provided solution specifically uses a thing's name to represent it
>and link between tables, the provided solution will fail to represent
>a non-named thing (ie a person whose name is unknown or doesn't have
>one).

Hi Neo,

Yes, that's correct. If the challenge had included the requirement to
store unnamed things as well, I wouldn't have designed my tables the way
they are now. Note however that things without name tend to be a PITA when
you want to do anything useful with it, unless other attributes are stored
that can be used to identify exactly which thing one wants to discuss.


>On the other hand, the XDb1 db's schema (created by the simple
>english-like script) allows a thing to have 0 to many names. In XDb1's
>db, the name is an attribute of a thing, the name isn't the thing
>itself.

The same holds for a relational database. In my design, the name is also
an attribute (aka column). Since I declared it as primary key, it has to
disallow nulls. If the challenge had called for nameless things, I would
have used a different primary key (probably an autogenerated integer that
would never be exposed to the end users).


> The easiest way to check this functionality is to unname john
>in a completed db. To unname john, click thing/person/john twice with
>a small delay to enable edit mode. Clear out 'john' and press enter.
>Note, when a thing has no name, XDb1 displays it's secondary classes,
>if any and its properties, if any, thus after unnaming john, one will
>see 35, his age. In addition, the report will run as before, except
>'john' will be replaced with 'person 35', the unnamed things classes
>and properties.

I checked, and this does indeed work as advertised.

>If you insist that unnaming a thing is not the same as creating an
>unnamed thing in the first place, you can create the db with the
>original script, except replace all instances of the word 'john' with
>'*'. XDb1's NLI takes '* isa person' to mean, create an unnamed
>instance of person class. In succeeding statements, '*' continues to
>refer to the last unnamed thing created (XDb1's NLI is currently far
>from complete and its GUI or API is required for many operations).

This incompleteness shows if you try to enter an unnamed person, an
unnamed dog and an unnamed computer. Changing all 'john', 'fido' and
'laptop1' to '*' in the script is a good way to mess things up big time.
While multiple unnamed things are indeed allowed, it becomes impossible to
explain to XDb1 which of the unnamed things I'm referring to. As far as
I'm concerned, this shows why it's generally better to insist on naming
things.

Of course, you already pointed out that the NLI is incomplete. I'm sure
that you'll find a way to improve this. But please do note that I can
change my SQL Server database to allow nameless things without having to
contact the databse vendor, whereas an XDb1 user is stuck with the current
functionality until you find the time to fix this.

Both implementations do not provide full support for things without name,
things sharing a name or things having multiple names. All these points
can be addressed in my implementation by adding and/or changing some
tables and/or columns, with no need to contact the vendor. All these
points can only be changed by the vendor in XDb1's implementation. I'd say
that, unless you happen to *be* the vendor, my implementation should be
considered more generic.


>The main point I wanted to emphasize is not what the GUI has or hasn't
>implemented, but the broader scope afforded by representing data with
>high degree of flexibility and normalization (replacing redundant
>things with a suitable ref to the original; things' names are not
>suitable refs).

**IF** the problems' requirements are such that one might expect the need
to store unnamed things, then the name should not be considered as
possible primary key. But since that requirement was not clear from your
original $1000 challenge post, it doesn't invalidate my claim.

Nick Landsberg

unread,
May 23, 2004, 6:31:17 PM5/23/04
to
Neo wrote:

>>OK folks, I have found this discussion fascinating and educational.
>>Given the possible implementations, how long would this query run
>>on a 1 million record database?
>
>
> It could take hours, days, months, maybe even years, depending on the
> number of interconnections between things (I believe the human brain
> performs somewhat similar computations but in a massively parallel
> manner)!
>
> Sorry, but XDb1 is not suitable for commerical applications yet. It is
> currently mostly an experimental db meant to test out an alternate

> data model. ...

Thank you for the reply, Neo, and especially for the bit of
unadorned honesty in the above sentence :)

> ... IMO, an apple-to-apple comparision hasn't been made yet


> because of the following differences:
>

[ Much Snippage ]

>
> Summary of processing time for a small data set (8 things):
> RM's simple solution: 65 ms (typical)
> XDb1's simpler solution: 2.23 ms (max)
> XDb1's generic solution: 2 or 3 ms (typical)
>
> On a larger hierarchy consisting of 200 goats (5 generations x 40
> goats/generation, each goat having two parents):
> RM's simple solution: 11 sec (lowest of 11,16,15,15,14,11,13,14,14,13)
> XDb1's generic solution: 5.438 sec (only 1 run thus far with beta
> v4.5.0)
>
> Please note: the above are apple-to-orange comparisions.

Yes they are, but they give me a feel for the
relative orders of magnitude. (It was good of you
to provide detailed specs on the machine in your
reply. And it was also good of you to run the
tests. I did not mean to impose.)

As a "less-than-first-order" approximation (zero-th approximation?)
it would seem that XDb1 scales better to larger datasets,
but with only two data points it is hardly more than
conjecture at this time.

What I would be more curious about is along the lines
of the following (bear with me please) -

Many users of databases want to have their cake and
eat it too. That is, they want to have speed of
access fo simple (single-record) retrieve operations
and at the same time, have access to all the data
for complex queries. In other words, have the speed
for transaction processing while having the flexibility
for reports. So, my next question is:

How does this critter perform on a simple query which
extracts just the record for "John Smith" which may happen
to be at a leaf node. (I hope I have the terminilogy
right.)

No need to run the tests any time soon. I'm just curious.
I think this critter has possibilities and may be
of commercial interest several years down the road.

(Note that the Relational folks had to go through
the same trial-by-fire some 20+ years ago. The
first implementations were nowhere near as fast as
the then-current CODASYL DBMS's, but we learned
how to optimize and do strategic denormalization
as we went along. This sounds like it needs to
go through the same kind of learning curve.)

Nick L.

Hugo Kornelis

unread,
May 23, 2004, 6:51:05 PM5/23/04
to
On 22 May 2004 15:40:42 -0700, Neo wrote:

>> Now please tell me why a user who wants to name a dog 'john' would first
>> enter its name as 'fido' and then change the name later?
>
>You are technically correct. Here are the steps to enter a dog whose
>name is 'john' right from the start of his instantiation.
>

(snip)


>
>Currently XDb1's NLI is quite incomplete. For many operations, user
>must use GUI or API.

Hi Neo,

Though this works, it is still far from intuitive. How would an avarage
end user know that there already is a thing named 'john' in the database
when (s)he is required to enter dog john's data? This would force the end
user to always check if the name is already in use.

The '*' for unnamed notation also introduces other problems which I point
out in another message.

> My goal isn't to demo NLI/GUI's limited strengths
>and many weakness, but that data can be represented in a highly
>normalized and generic manner and that it can be processed quickly
>especially when the equivalent data/schema is spread out over several
>tables in RM.

I think I better not comment this....


>Thus, while the db created in XDb1 allows things to have the same name
>(right from the start) and the provided db doesn't, it is not as
>generic of a solution.

Like I said in another message, neither XDb1's implementation nor my
implementation for SQL Server provide full support for this. This support
can be added to my implementation by everyone with basic table design
skills and some SQL experience, without a need to contact the vendor; for
the XDb1 implementation full support for things with the same name can
only be provided by the vendor. This makes XDb1 the more flexible and
generic implementation for Neo and my db the more flexible and generic
implementation for everybody else.


> But I know you could make it more generic so a
>fairer comparision can be made.

I can indeed. In another message I posted earlier today, I already
described a possible way to do this. If you have an urge to compare XDb1
to SQL Server, by all means use my suggestions to change the solution I
came up with (which probably is *not* the fastest solution possible).

As far as I'm concerned, this discussion is not about comparing products,
but about a $1000 challenge that you put up. I gave you an implementation
that met all requirements that you have set in your original message. I am
not telepathic, so I probably have missed one or two requirements that you
didn't include in your message. Since telepathy was not a requirement
either, I consider all your remarks about things with no names, things
with multiple names, multiple things with one name and whatever you'll
come up with next as interesting stuff for a discsussion, but irrelevant
to the challenge.

You owe me $1000, Neo.

Hugo Kornelis

unread,
May 23, 2004, 7:37:22 PM5/23/04
to
On 22 May 2004 16:28:42 -0700, Neo wrote:

>> > The report is as before except occurrences of 'fido' are
>> > replaced with 'john'. In this case, printing each thing's class (ie
>> > 'person/john', 'dog/john') would allow user to distinguish them.
>>
>> (giggle) not exactly, Neo. I got the following when I tried it...
>> I especially like the line "john john john 1".
>
>In general, there is roughly an app layer and a data layer. If the app
>has to change or extend its logic in order to meet new conditions
>(like adding a thing's class to the report, that in general is
>acceptable to me. But what is not as acceptable is either a failure to
>model the data correctly (ie the class hierarchy) and/or a schema that
>is non-generic (ie. two things can't have same name or inability to
>handle unamed things). The db implemented with XDb1 allows each thing
>to have 0 to many names, and any two things can have the same name,
>and the class hierarchy is explicit and can be of any depth.

Hi Neo,

I think that being the developer of a database gives you a slightly
different view on things than other people have. For you, chainging the
application might be easy. But what if I had bought XDb1 and actually
*needed* this report - including the need to distinguish john the dog from
john the person? The only way to accomplish this would be to contact you
and ask if you could please make a patch to enable this for me. Well,
maybe you and your partners are a brilliant and very service-minded group.
Maybe you'd spend the whole night at your computers and send me the patch
first thing in the morning. But then again - maybe not. Maybe you're
having a day off. Maybe you have a policy against sending out patches.
Maybe you are all busy for another customer first. There is no way that I
(or my boss) can direct you to change the application *NOW*. And that's
the kind of dependency that puts many decision makers off.

On the other hand, my implementation can be changed by almost any DBA or
SQL coder. Add some tables of columns, convert the data, change the
procedure, test it ... and it's already done. If we're in a hurry, my boss
can tell ME to drop everything else and solve this first. He can order me
some food and have me phone my wife that I'll be working late again. He
can rely on me (or another DBA/SQL coder) to have the change done in a
short time. That's the kind of flexibility that bosses tend to like.


>After changing fido's name to john or creating a dog named john (the
>same name as that of the 1st person), the report runs "correctly" in
>the sense that it is the same as before except the word 'john' now
>also appears where 'fido' was before. Thus the prior output 'john fido
>john 1' became 'john john john 1'. Of course, now the problem is that
>the user can't distinguish john the person from john the dog, until
>the app adds that info. Previously, when I stated that XDb1 would
>print 'person/john' and 'dog/john', I meant XDb1 would be able to do
>it by altering the app layer but without adding additional data or
>modifying db schema

which would require placing a service request by the database vendor

>, which is not the case with the provided solution,
>since it does not model the class hierarchy correctly (it fails to
>implicity store the class of 1st-level things such as person, dog,
>computer and it fails to allow deeper class hierarchies. See other
>thread for details).

I already provided an extra table to address the possibility of class
nesting (which incidentally was not clear from your original post).


>> Please indicate which part of that message explains the requirement
>> that duplicate names should not be allowed on entry...
>
>The part that says "Report from normalized and NULL-less data and the
>solution must be as generic". For starters, a solution that can't
>handle multiple things with the same name is not as generic.

1. Your quote is incomplete. In the original message, the requirement
"must be as generic" was defined as:

(quote original message)
"(...) must be as generic, meaning allow the user to create
any hierarchy, consisting of different types of things
(each type to allow different attributes) and each thing
in the hierarchy to have any number of parents."

My implemetation does allow the user to create any hierarchy (which XDb1
doesn't), it does allow different types (classes) of things in each
hierarchy, it does allow each type and even each thing to have it's own
attributes and each thing may have any number of parents. I met all the
requirements you stated. You didn't state anything about things sharing
the same name.

2. Now let's suppose that I had somehow guessed that the above quote was
not intended as the definition of "generic" for this challenge. In that
case, the phrase "must be as generic" seems to miss something: it must be
as generic as what? I think this can only be taken to mean "as generic as
described in this message and the related web page". There's nothing in
the message or on the web page to indicate that multiple things might have
the same name. In fact, both the sample data and the NLI used to describe
the challenge seem to indicate that a thing's name is unique within the
universe of discourse.

3. Let's take it one step further. Let's pretend that I had somehow
guessed that you meant "as generic as XDb1". Then you would have set an
impossible task, since there's nobody who knows exactly how generic XDb1
is except you - and you didn't tell us in the original message! You didn't
seriously expect me to go disassemble, decompile and reverse engineer
XDb1, now did you? But in fact, that would have been the only way to know
how generic XDb1 is! Without either me disassembling the program or you
telling me how generic it is, how can I know whhat XDb1 can or can't do?
Heck, maybe you've even put foreign language support in it and you're
going to dismiss my entry for not understanding "god istein ding." or
"john estun person."

But of course you wouldn't want all takers of your challenge to go and
reverse engineer XDb1. So I immediately knew that you didn't mean "as
generic as XDb1", since if you had, you'd surely have provided full
details on how generic XDb1 is. Which brings me back to point 1, the full
quote, which I at that time took for a description of how generic XDb1 is.


> But I
>know you could make your solution more generic to allow a fairer
>comparision.

I could, but I won't. Not now. I don't want to compare, I entered a
challenge and I want to collect the prize.

Hugo Kornelis

unread,
May 23, 2004, 8:09:47 PM5/23/04
to
On Sun, 23 May 2004 00:13:06 GMT, Nick Landsberg wrote:

>
>[ Everything Snipped ]
>
>OK folks, I have found this discussion fascinating
>and educational. However I would like to add
>one more item into the equation, and it is not
>theory (although I notice that c.d.theory is
>in the distribution list) ---
>
>Given the possible implementations, how long
>would this query run on a 1 million record
>database? (That's actually small by today's standards)
>And then for a 10 million record database.
>
>Pick your own hardware, e.g. raid arrays,
>2.5 GHZ CPU's etc., but tell me how long
>this query would run, please.
>
>Theory is nice, but, in my experience,
>performance is what sells systems to
>customers.

Hi Nick,

Fair question. In between answering Neo's latest messages, I was
constantly switching to my Query Analyzer window. I've been busy writing
out a script that will generate millions of rows in the things and add
random dependencies in the "leader" hierarchy, so I could test some
things.

Some disclaimers to start with:
1. I think I'm pretty good at designing tables and writing SQL. But I have
second to none experience in tuning heavy-duty systems. I can (and
probably will) tweak some things here and there and gain some performance,
but it's not what I do best.
2. I don't have the money to spare on the kind of hardware that would be
present in a shop that runs this kind of queries on 1 million record
databases. I'll do some tests with large numbers, but all I can offer is a
Desktop PC, powered with one 1.3GHz AMD Athlon, 256 MB of RAM and equipped
with two 40GB 7200rpm IDE harddisks (I have the log on one of the HDs and
the data on the other - for this kind of queries, I'd sure like to plant
tempdb on a third HD).

With only this computer to my disposal, I'll have to trim down the number
of rows considerably. It's not that my hardware can't handle a table with
1 million (or even 10 million) rows - but while waiting for the results of
my preliminary tests (using "only" roughly 0.5 million things), it
suddenly occured to me that the report would be based on the Carthesian
product of the things table with itself - almost 250,000,000,000 rows for
my preliminary test, more than 121,000,000,000,000 rows if I would run the
complete test set I prepared (sporting over 11 million things). Just
sending the report to the screen would probably take days!! And I'd need
to add some really beefy hardware to store it.

I just saw that it's past 2AM already, so I'll have to get back to you
later with test results from a "slightly" smaller test set.

Two final notes:
1. I did already catch Neo's reply to you. I'll still try to run the tests
I intended, even though Neo probably won't. Just out of curiosity.
2. I expect my implementation to be quite slow when dealing with a large
numbers of rows. In fact, I was quite surprised when my first attempt
already managed to beat XDb1's execution time. Some time ago, I saw some
messages by Joe Celko in another newsgroup about an alternate way to store
trees. I don't recall the details, but from what I still know, I expect it
to be lots (and lots and lots) faster.
After finishing the tests on my own implementation, I might go try and
find his method, see what performance that one will yield. Just out of
curiosity.

Nick Landsberg

unread,
May 23, 2004, 10:33:10 PM5/23/04
to

Thanks Hugo,

The reason why I posed the query is not so much to
make work for you, but to get a flavor of how the
two technolgies compare /as they are now/.

If Neo's techniques prove out, they may
be interesting for commercial work in several
years. One never knows. As I alluded to upthread,
these discussion remind me a lot about the
CODASYL vs. RDBMS discussions of roughly 20-25
years ago. Assuming that Neo's ideas pan out,
I would not presume that they would supplant
traditional RDBMS's, just as RDBMS's have
not completely supplanted hierarchical and
network models. There is more than one way
to skin a cat, and you use the best tools
available depending on the size of the cat :)

You and Neo have had an interesting discussion.
And, what is rare on usenet, haven't let the
discussion degenerate into a flame war.
I congratulate you both!

Neo

unread,
May 23, 2004, 11:33:15 PM5/23/04
to
> [Time elapsed for a hierarchy of 8 things]:
> RM's simple: 65 ms (typical, but as low as 30)
> XDb1's simpler: 2.23 ms (v4.5.0)
> *XDb1's generic: 3.65 ms (v4.5.0)
>
> [Time elapsed for a hierarchy of 200 things]:
> RM's simple: 11 sec (lowest of 11,16,15,15,14,11,13,14,14,13)
> XDb1's generic: 5.438 sec (v4.5.0)

A correction on above line starting with '*'.

Hugo Kornelis

unread,
May 24, 2004, 6:10:28 PM5/24/04
to
On 23 May 2004 13:49:30 -0700, Neo wrote:

Hi Neo,

(snip)

>The provided RM db's schema is not as generic as XDb1's in the
>following respects:
>1. It does not model a thing's name as an attribute.

It does, and then goes on to make that attribute the primary key.

>2. It does not model attributes with multiple values in a normalized
>manner.

Your original post did not contain that requirement. I already showed
several way to enhance my model to allow this, all (unlike XDb1) without
the need to contact the vendor.

>3. It requires a separate table for values of each each data type!!!

If that bothers you, replace them with one table and use the (SQL Server
proprietary) datatype "sql_variant". I chose to use two tables, because I
like to be sure when I can or can't subtract values from a columns and
because I prefer portable syntax over proprietary syntax when feasible.
Using two tables doesn't make my model less generic.

>4. It does not model the class hierarchy in the appropriate table thus
>making it inaccessible for reporting.

Your original post didn't point out that classes may form a hierarchy as
well. I already supplied an extra table to model that hierarchy, thereby
making it available for reporting.

>5. It does not model a way to normalize names of things. (ie
>person/john and dog/john)

First, I don't know why you use the term normalization here.

Second, it's XDb1 that can't display the names this way. I know that you
have plans to adapt XDb1, to include this functionality. But as it stands,
this output is not an option for the XDb1 model.

Third, my implementation does model a way to show the class of a thing in
front of it's name. Simply join in the classes table and use SQL Server's
'+' operator to concatenate the strings together. If the report you wanted
me to produce for the challenge had already contained the class, I would
have made my query this way right from the start.

>6. It does not model a way to normalize parts of a name (ie 'simth' in
>'john smith' and 'mary smith')
>7. It does not model a way to normalize symbols in names. (ie 'o' in
>'john' and 'god')

I'm aware that XDb1's philosophy as to what "normalizing" means is quite
different from the relation model's philosophy. But since you asked, in
your original message, for an implementation "using the relational model",
there's no point complaining that I normalized data according to the
relational normalization rules. It's what you asked for, isn't it?


>Also the provided RM db's data is not as normalized as XDb1's. RM's
>has duplicate data which needs to be replaced with proper refs (keys)
>to the one and original data within the db. A proper ref must be
>independent of data being represented, otherwise problems arise.

The relational model allows for the natural key to be used as a table's
primary key and for foreign keys (references) to use that (natural)
primary key. Therefor, my choice to use the name as the primary key in the
things table and as foreign key in the other tables is proper within the
context of the relational model.

The debate as to wether using a natural key is a Good Thing or slapping an
autoincrementing integer key onto every table is a Better Thing may still
cause Holy Wars in some newsgroups. I won't thread there now, though my
choice of words (and my provided model) might be taken as a hint to my
preference. But I'm not anal about it and I will (and do) use artificial
keys when I think that there's a good reason to do so.


> For
>example, in RM's, two things can't have the same name and can't
>properly represent an unnamed thing.

That was not a requirement for the original challenge. Also, I have
already hinted at possible ways to solve this issue.
I'll post another message later today with more on this issue.


>Some differences between XDb1's and RM's reporting algorithms:
>1. RM's report is more complete. It reports all of the nearest common
>ancestors, where as XDb1's only reports any one the nearest common
>ancestors.
>
>2. RM's report algorithm doesn't access the atttributes of any thing,
>where as XDb1's can. For instance, if a thing is unnamed, XDb1 prints
>the thing's classes and attributes!!!

Good point. I have to concede that it would take rather nasty SQL to mimic
this behaviour. I'm glad you didn't include this function of XDb1 in the
challenge, otherwise I'd have had a hard time matching your performance on
the sample data!


>3. XDb1's algorithm is extremely memory efficient. The major variable
>in calculating memory requirements are 3 integer arrays that must be
>larger than the depth of the hierarchy, regardless of the number of
>things in the hierarchy. Thus for a hierarchy 100 deep, XDb would need
>300 integers for these arrays. RM's creates an intermediate table
>whose size can be much much larger since there can be nearly an
>infinite number of things in a 100-depth hierarchy!

The tests I ran with large amounts of data show that my procedure's
running time rises exponentially when more data is added. My model is
definitely not a good implementation for generating this report out of any
serious amount of data.

The description of your algorithm is intriguing to say the least. Even
while knowing that there apparently is a way to generate this report with
no more intermediate storage that 300 integers for a 100-depth hierarchy,
I still can't figure out how you do it! Care to elaborate?


>With a very small dataset, on a off-line, 500 Mhz, Dell PowerEdge, 512
>MB, Ultra SCSC II, 10,000 RPM HDs, NT 4 sp6a, SQL Server 7: RM's
>solution takes 65 ms on avg, but as low as 30ms and as high as 1186
>ms.

You never replied to my message where I gave some possible reasons why
your server gave much slower response than mine. Did you check the
collation? This will definitely have a SEVERE impact on the execution time
of my original entry for the challenge.

> XDb1's simplified solution (only printing thing's ID)

is mildly interesting, but of no relevance for the challenge. Plus, it
doesn't produce human-readable output, so it's of no relevance to anybody
else either.

> which very
>roughly approximates RM's solution (prints char-based keys which
>happen to encode thing's name also),

Not "happen to" - I deliberately chose to use a natural key, even though I
am aware that this would hamper the performance.
Please check a soon-to-be-posted other message to find out why I have made
an alternate implementation (using artificial keys, like you so much seem
to desire) and how that actually increased my performance!

> takes as high as 2.23 ms. XDb1's
>generic solution (which is normalized down to atomic symbols and
>accesses the same symbol 'o' in the db when printing 'john' and 'god'
>on the report and will print an unnamed thing's classes and
>properties), takes 2 or 3 ms.
>
>Summary of processing time for a small data set (8 things):
>RM's simple solution: 65 ms (typical)
>XDb1's simpler solution: 2.23 ms (max)
>XDb1's generic solution: 2 or 3 ms (typical)
>
>On a larger hierarchy consisting of 200 goats (5 generations x 40
>goats/generation, each goat having two parents):
>RM's simple solution: 11 sec (lowest of 11,16,15,15,14,11,13,14,14,13)
>XDb1's generic solution: 5.438 sec (only 1 run thus far with beta
>v4.5.0)
>
>Please note: the above are apple-to-orange comparisions.

Yes - especially if you have SQL Server set to do case insensitive string
comparisons, that incur quite some overhead. I just checked what happend
if I enter "hugo isa person." / "Hugo is 35." (note the capital H in the
second Hugo!) in XDb1; it asks me "What is 'Hugo'?" so I can conclude that
XDb1 uses case sensitive string comparisons. There's no reason to deny SQL
Server that priviledge.

Hugo Kornelis

unread,
May 24, 2004, 6:48:07 PM5/24/04
to
Hi Neo,

I promised a second post - here it is.

When I was testing my algorith against larger amounts of data, I noticed
that it was bugged. The algorithm to determine all ancestors of each thing
may sometimes try to insert the same thing/ancestor combination with two
different path lengths. I had to use a GROUP BY and a MIN(..) to solve
this.

I didn't like the performance hit I took (up to 14.3 ms avg) - still
better than the 16 ms that XDb1 needs on this same machine but I figured I
could do better.

I changed my model to use integer columns with the IDENTITY property (SQL
Server's name for autoincrementing artificial ID) as primary keys. This
improved the performance of my report generation; the average execution
time is now down to 11.0 ms. Note that my report still shows the names of
the things in the hierarchies, not the ID's. (SQL and output are down
below).

Incidentally, this quicker implementation is also better equipped to adapt
to many of the changes you suggested after your original post.
You want different things to share the same name? Easy - just remove the
unique constraint on thingName and you're done.
You want things without name? Remove thingName from the things table,
create a new table thingNames with columns thing (primary key / foreign
key to things) and thingName.
You want things with multiple names? As above, but primary key on
thingNames table should span both columns.

All this doesn't matter for the challenge you set, as these were all
requirements that you brought into the equation at a later time. What does
matter is my improved model, that you'll find below. The average execution
time is 11.0 ms (as you can see from the output). This is based on an
execution on my desktop system (1.3 GHz Athlon, 256 MB RAM, 2 harddisks at
40 GB / 7200 rpm each) with several open applications. XDb1 takes 16 ms
for generating the report on the same machine with the same applications
still active.


(Start of SQL for improved model)

-- define some tables to hold the user input
create table things (thing int not null identity,
thingName varchar(20) not null unique,
primary key (thing))
create table classes (class int not null identity,
className varchar(20) not null unique,
primary key (class))
create table HierarchyTypes (hierarchy int not null identity,
hierarchyName varchar(20) not null unique,
primary key (hierarchy))
create table ClassOfThings (thing int not null references things on update
cascade on delete cascade,
class int not null references classes on


update cascade on delete cascade,

primary key (thing))
create table ClassHierarchy (SubClass int not null references classes on


update cascade on delete cascade,

SuperClass int not null, -- references


classes on update cascade on delete cascade,
primary key(SubClass, SuperClass))

-- update and delete triggers needed to cascade updates and deletes of
classes to SuperClass;
-- SQL Server doesn't accept this second DRI action, unfortunately.
create table attributes_int (thing int not null references things on


update cascade on delete cascade,

attribute varchar(10) not null,
value int not null,
primary key (thing, attribute))
create table attributes_char (thing int not null references things on


update cascade on delete cascade,

attribute varchar(10) not null,
value varchar(100) not null,
primary key (thing, attribute))
create table hierarchies (thing int not null references things on update
cascade on delete cascade,
otherthing int not null, -- references things on


update cascade on delete cascade,

hierarchy int not null references HierarchyTypes


on update cascade on delete cascade,

primary key (hierarchy, thing, otherthing),
check (thing <> otherthing))
-- update and delete triggers needed to cascade updates and deletes of
things to otherthing;
-- SQL Server doesn't accept this second DRI action, unfortunately.
-- this table is used to generate the report
create table ancestors (thing int not null,
ancestor int not null,
dist int not null,
primary key (thing, ancestor))
create unique nonclustered index testindex on ancestors(ancestor, thing)
-- this table will hold the report - here, the things' name is used
create table NCancestor (ThingX varchar(20) not null,
ThingY varchar(20) not null,
CmnAnc varchar(20) not null,
Dist int not null,
primary key (ThingX, ThingY, CmnAnc))
go
-- this procedure creates the report
create proc CommonAncestors
(@hierarchyName varchar(20))
as
declare @hierarchy int
select @hierarchy = hierarchy
from HierarchyTypes
where hierarchyName = @hierarchyName
-- delete data from previous execution
truncate table ancestors
truncate table NCancestor
-- starting data: thing itself is dist 0, direct leader is dist 1
insert ancestors with (tablockx holdlock) (thing, ancestor, dist)
select distinct thing, thing, 0
from things with (tablockx holdlock)
union all
select thing, otherthing, 1
from hierarchies with (tablockx holdlock)
where hierarchy = @hierarchy
-- use iteration to determine indirect leaders
while (@@rowcount<>0)
insert ancestors with (tablockx holdlock) (thing, ancestor, dist)
select a1.thing, a2.ancestor, min(a1.dist + a2.dist)
from ancestors AS a1
join ancestors AS a2
on a2.thing = a1.ancestor
where not exists
(select *
from ancestors AS a3
where a3.thing = a1.thing
and a3.ancestor = a2.ancestor)
group by a1.thing, a2.ancestor
-- now find nearest common ancestor for each pair of things
insert NCancestor with (tablockx holdlock) (ThingX, ThingY, CmnAnc, Dist)
select t1.thingName, t2.thingName, ta.thingName, a1.dist + a2.dist
from ancestors AS a1 with (tablockx holdlock)
join ancestors AS a2
on a1.ancestor = a2.ancestor
and a1.thing < a2.thing
join things AS t1
on t1.thing = a1.thing
join things AS t2
on t2.thing = a2.thing
join things AS ta
on ta.thing = a1.ancestor
where not exists
(select *
from ancestors AS a3
join ancestors AS a4
on a3.ancestor = a4.ancestor
and a3.thing < a4.thing
where a3.thing = a1.thing
and a4.thing = a2.thing
and a4.dist+a3.dist < a1.dist+a2.dist)
go
-- dummy execution on empty table forces compilation of stored proc
-- (I assume that Xdb1 uses a precompiled procedure as well)
exec CommonAncestors @hierarchyName = 'whatever'
go
--
-- now add the predefined data
insert things (thingName)
values ('god')
insert things (thingName)
values ('army')
insert things (thingName)
values ('trinity')
insert things (thingName)
values ('john')
insert things (thingName)
values ('mary')
insert things (thingName)
values ('luke')
insert things (thingName)
values ('fido')
insert things (thingName)
values ('laptop1')
go
insert classes (className)
values ('force')
insert classes (className)
values ('church')
insert classes (className)
values ('person')
insert classes (className)
values ('dog')
insert classes (className)
values ('computer')
go
insert HierarchyTypes (hierarchyName)
values ('leader')
go
insert ClassOfThings (thing, class)
values (2, 1)
insert ClassOfThings (thing, class)
values (3, 2)
insert ClassOfThings (thing, class)
values (4, 3)
insert ClassOfThings (thing, class)
values (5, 3)
insert ClassOfThings (thing, class)
values (6, 3)
insert ClassOfThings (thing, class)
values (7, 4)
insert ClassOfThings (thing, class)
values (8, 5)
go
insert attributes_int (thing, attribute, value)
values (4, 'age', 35)
insert attributes_int (thing, attribute, value)
values (5, 'weight', 130)
insert attributes_char (thing, attribute, value)
values (6, 'color', 'red')
go
insert hierarchies (thing, otherthing, hierarchy)
values (2, 1, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (3, 1, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (4, 2, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (5, 2, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (5, 3, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (6, 3, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (8, 4, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (8, 5, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (7, 4, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (7, 5, 1)
insert hierarchies (thing, otherthing, hierarchy)
values (7, 6, 1)
go
--
-- random name / hierarchy generation for lots of data goes here
--
print 'Initialization done - starting actual test'
go
--
-- generate the report - show elapsed time
declare @started datetime, @ended datetime
set @started = current_timestamp
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
exec CommonAncestors @hierarchyName = 'leader'
set @ended = current_timestamp
select convert(char(24), @started, 121) as started,
convert(char(24), @ended, 121) as ended,
convert(char(14), (@ended - @started), 114) as "elapsed (10
executions)"
-- display the results
select *
from NCancestor
order by ThingX, ThingY
go
--
-- cleanup after execution
drop proc CommonAncestors
drop table NCancestor
drop table ancestors
drop table hierarchies
drop table attributes_int
drop table attributes_char
drop table ClassHierarchy
drop table ClassOfThings
drop table HierarchyTypes
drop table classes
drop table things
go

-------------------------------------------------------------------------

(Output after running the above script)

Initialization done - starting actual test
started ended elapsed (10 executions)
------------------------ ------------------------ -----------------------
2004-05-25 00:37:03.780 2004-05-25 00:37:03.890 00:00:00:110

ThingX ThingY CmnAnc Dist
-------------------- -------------------- -------------------- -----------
army fido army 2
army john army 1
army laptop1 army 2
army luke god 3
army mary army 1
army trinity god 2
fido laptop1 john 2
fido laptop1 mary 2
god army god 1
god fido god 3
god john god 2
god laptop1 god 3
god luke god 2
god mary god 2
god trinity god 1
john fido john 1
john laptop1 john 1
john luke god 4
john mary army 2
luke fido luke 1
luke laptop1 trinity 3
mary fido mary 1
mary laptop1 mary 1
mary luke trinity 2
trinity fido trinity 2
trinity john god 3
trinity laptop1 trinity 2
trinity luke trinity 1
trinity mary trinity 1

-------------------------------------------------------------------------

Hugo Kornelis

unread,
May 24, 2004, 7:25:49 PM5/24/04
to
On Mon, 24 May 2004 02:33:10 GMT, Nick Landsberg wrote:

Hi Nick,

(snip)


>
>Thanks Hugo,
>
>The reason why I posed the query is not so much to
>make work for you, but to get a flavor of how the
>two technolgies compare /as they are now/.

Your remark made me curious to find out how my model would hold against
more data. I created a script to generate lots of random data, ensuring
that the hierarchy would remain connected, strictly top-down yet
distributed as random as possible. I think I succeeded (generation script
available, in case anyone is interested).

The setup I used for these tests is:
* SQL Server 2000 SP3a, running as a service on my desktop computer, with
only one active application (Query Analyzer).
* 1.3 GHz Athlon processor / 256 MB RAM.
* Two 40GB 7200 rpm IDE Harddisks. The LOG file for my test DB is on one
of them. The DATA file for my test DB, plus the DATA and LOG file for
tempdb (used by SQL Server to store intermediate result sets) is on the
other harddisk. My Windows page file is on that hard disk as well (though
on another partition).
* I made sure that both the test database (DATA and LOG!!) and tempdb were
large enough to hold all intermittant data and the result. SQL Server will
automatically allocate extra disk space if a database needs more room, but
repeatedly allocating extra disk space is a performance killer (for the
first execution only!!), so I decided to preallocate the space.
* For all tests, I first recreated the tables and procedure, filled the
tables with the original data and then added extra rows in the things and
hierarchies tables.
* The results below show: number of rows in the things and hierarchies
table (indicating input volume), number of rows in ancestors and
NCancestor tables (indicating size of intermediate and output table) and
time elapsed for one execution of the stored procedure.
* I did some minor tweaking to the procedure after carrying out these
tests but before posting Neo the code (see other message). I think that
the improvement is only marginal - at least not enough to warrant
repeating all tests again.


Here are the results I gathered:

things hierarchies ancestors NCancestor elapsed
----------- ----------- ----------- ----------- --------------
18 31 81 153 00:00:00:017
108 210 1307 5778 00:00:00:513
208 505 4566 21528 00:00:02:407
408 1101 16938 83028 00:00:14:377
608 1700 36619 184528 00:00:51:877
1008 2900 94318 507528 00:05:03:577


And two larger tests in a very a-typical setup (running SQL Server in
single-user mode - normally only used for maintenance, but apparantly also
quicker)

things hierarchies ancestors NCancestor elapsed
----------- ----------- ----------- ----------- --------------
1000 2883 94255 499500 00:04:00:250
2500 7381 464258 3123750 01:02:13:580

As you can see, elapsed time grows exponentially as the size of the input
increases.

Can the above performance be improved upon? Yes, of course. If I had
unlimited money to spare, I'd start with the following setup:

* Have SQL Server run on a dedicated server. Enable only the services that
SQL Server needs, shut down everything else.
* Fastest processor currently available. Though SQL Server can use
multiple processors, I don't think they would help for this particular
query (but if money is not an issue, I'd still try, just to be sure).
* As much RAM as the server will hold. Over 2GB is only used by SQL Server
Enterprise Edition, so get a license for that version as well.
* At least five seperate I/O devices: two stripe sets (for test DB Data
and tempdb Data) (the more physical disks in the stripe set, the better),
and three non-striped disks (one for test DB Log, one for tempdb Log and
one for operating system, including pagefile).
* I'd hire a tuning expert. Though I do know a few tricks, I'm definitely
not an expert on fine-tuning SQL Server.
* And I'd buy Joe Celko's latest book on trees and hierarchies in SQL. His
alternative approach I hinted at in an earlier message will not do
unfortunately - I expect it to be blindingly fast, but it can only handle
hierarchies where each thing has at most one parent. But maybe he's got
some other neat tricks in his book that I might want to steal.

Another thing I'd consider is moving some work from the stored procedure
to an insert trigger. I didn't do this for the challenge, of course, but
if this were to become a real production system, I'd definitely try to
find out if insert/update/delete performance for typical daily data entry
tasks would be acceptable if I move some preparations for this report to
the trigger.


But I'm also sure that there's no way that I would ever get this procedure
to perform adequately (unless I'd find a completely different approach in
Celko's book). All performance gains that better hardware can buy are
linear; the performance hit for extra input data is exponential. That
means that there's no way to keep the performance aceptable as the data
continues to grow.

Is this a surprise to me? No, it isn't. Frankly, I've never considered a
relational database as a good tool for storing and traversing trees. There
are many other tools out there more suited for that task.

I never entered this discussion to prove that the relational model is well
suited for tree-structures, because frankly, I think it isn't.I only
entered this discussion because Neo set a challenge and I decided to take
it on - and succeeded.


>
>If Neo's techniques prove out, they may
>be interesting for commercial work in several
>years. One never knows. As I alluded to upthread,
>these discussion remind me a lot about the
>CODASYL vs. RDBMS discussions of roughly 20-25
>years ago. Assuming that Neo's ideas pan out,
>I would not presume that they would supplant
>traditional RDBMS's, just as RDBMS's have
>not completely supplanted hierarchical and
>network models. There is more than one way
>to skin a cat, and you use the best tools
>available depending on the size of the cat :)

Agreed. There is no "one size fits all" in our industry.


>
>You and Neo have had an interesting discussion.
>And, what is rare on usenet, haven't let the
>discussion degenerate into a flame war.
>I congratulate you both!

Thanks, Nick!

Neo

unread,
May 24, 2004, 9:10:51 PM5/24/04
to
> How does this critter perform on a simple query which extracts just
> the record for "John Smith"...

Comparisions haven't been made yet but it should fair well in some
cases. For example the name "john smith" in essence has a reference to
all things with that name even if it is a dog. In this case resolving
to the named things doesn't require traversing persons or dogs.

Below is an example of finding things based on intersections expressed
via simple "relational expressions". Suppose user had entered the
following data:

color isa thing.
black isa color.
white isa color.

person isa thing.
john smith isa person.
john smith is black.

mary isa person.
mary is white.

bob isa person.
bob is black.
bob is white. (he is both black and white)

dog isa thing.
fido isa dog.
fido is white.

Following query finds all persons:
(person)

Following query finds a person named mary
(person) mary

Following query finds a person named john smith:
(person) "john smith"

Following query finds all persons who are black (john & bob):
(person) (black)

Following query finds all persons who are black and white (only bob):
(person) (black) (white)

Following query finds all things that are white (mary & bob & fido!):
(white)


Other examples are posted in recent posts in comp.database.object and
at www.xdb1.com/Example/Ex005.asp

Nick Landsberg

unread,
May 24, 2004, 10:43:53 PM5/24/04
to
Hugo Kornelis wrote:

Thanks for the additional info, Hugo.

Some comments and some snippage below :)

> On Mon, 24 May 2004 02:33:10 GMT, Nick Landsberg wrote:
>
> Hi Nick,
>
> (snip)
>
>>Thanks Hugo,
>>
>>The reason why I posed the query is not so much to
>>make work for you, but to get a flavor of how the
>>two technolgies compare /as they are now/.
>
>
> Your remark made me curious to find out how my model would hold against
> more data. I created a script to generate lots of random data, ensuring
> that the hierarchy would remain connected, strictly top-down yet
> distributed as random as possible. I think I succeeded (generation script
> available, in case anyone is interested).
>
> The setup I used for these tests is:
> * SQL Server 2000 SP3a, running as a service on my desktop computer, with
> only one active application (Query Analyzer).
> * 1.3 GHz Athlon processor / 256 MB RAM.

This would be underpowered for a commercial setup
from a CPU standpoint and from a memory standpoint.
As a reference point, what I use at work is a
12 CPU Sparc @ 1.28 GHz with 64 GB of memory.
(But the 12 CPU's don't help much with query analysis
since that has to be single threaded. They DO help
on other respects). No, we don't run SQL-server.

> * Two 40GB 7200 rpm IDE Harddisks. The LOG file for my test DB is on one
> of them. The DATA file for my test DB, plus the DATA and LOG file for
> tempdb (used by SQL Server to store intermediate result sets) is on the
> other harddisk. My Windows page file is on that hard disk as well (though
> on another partition).

Good partitioning. But with only 256 MB RAM you were
beating the living spit out of the disks, which may
explain some of the exponential behaviour you noted
below. Those disks have an average latency of
about 6-8 ms. per random I/O. I have no idea
whether or not SQL-server would write to both
data and log files for tempdb at the same time,
but this may have increased the latency with many
(almost) random seeks.

> * I made sure that both the test database (DATA and LOG!!) and tempdb were
> large enough to hold all intermittant data and the result. SQL Server will
> automatically allocate extra disk space if a database needs more room, but
> repeatedly allocating extra disk space is a performance killer (for the
> first execution only!!), so I decided to preallocate the space.

OK

> * For all tests, I first recreated the tables and procedure, filled the
> tables with the original data and then added extra rows in the things and
> hierarchies tables.

This eliminates the effects of caching, but is also
a worst case. From a performance paranoids perspective
this is "a good thing." :)

Yes, it does. I am still wondering tho, about how much of that
elapsed time was disk I/O latency vs. CPU time. It depends on the
size of the in-memory cache (which I said above was small for
commercial implementations) and the effectiveness of the caching
algorithm, usually LRU.

>
> Can the above performance be improved upon? Yes, of course. If I had
> unlimited money to spare, I'd start with the following setup:

Wouldn't we *ALL* love to have unlimited money. :)

>
> * Have SQL Server run on a dedicated server. Enable only the services that
> SQL Server needs, shut down everything else.

Typical commercial installations of any DBMS do this.

> * Fastest processor currently available. Though SQL Server can use
> multiple processors, I don't think they would help for this particular
> query (but if money is not an issue, I'd still try, just to be sure).

That depends on the implementation, but I believe you're right
that it probably would not help for this particular query.

> * As much RAM as the server will hold. Over 2GB is only used by SQL Server
> Enterprise Edition, so get a license for that version as well.

See comment about 64 GB RAM above. I'm spoiled.

> * At least five seperate I/O devices: two stripe sets (for test DB Data
> and tempdb Data) (the more physical disks in the stripe set, the better),
> and three non-striped disks (one for test DB Log, one for tempdb Log and
> one for operating system, including pagefile).

Absolutely! A RAID array with non-volatile cache might
help, assuming that you're I/O limited rather than CPU
limited. If you're CPU limited this might not help.

> * I'd hire a tuning expert. Though I do know a few tricks, I'm definitely
> not an expert on fine-tuning SQL Server.

My experience has been that fine tuning gets you maybe
20-25% improvement at best. (Unless you've really
screwed up in the beginning.) Your exponential response
with size of data will stay exponential. Sorry.

> * And I'd buy Joe Celko's latest book on trees and hierarchies in SQL. His
> alternative approach I hinted at in an earlier message will not do
> unfortunately - I expect it to be blindingly fast, but it can only handle
> hierarchies where each thing has at most one parent. But maybe he's got
> some other neat tricks in his book that I might want to steal.

In this business, stealing from a respected author
is considered an honorable thing to do. :)

>
> Another thing I'd consider is moving some work from the stored procedure
> to an insert trigger. I didn't do this for the challenge, of course, but
> if this were to become a real production system, I'd definitely try to
> find out if insert/update/delete performance for typical daily data entry
> tasks would be acceptable if I move some preparations for this report to
> the trigger.

This is what I'm most concerned about the difference in the two
approaches. From experience (and many, many tests), I can
predict what the performance of single-record inserts, updates,
deletes, etc. will take for any given schema within tolerable
limits given an RDBMS. Generally speaking, fine-tuning report
generation forces you to pay a penalty when doing individual
transactions and vice-versa. (But we all know that, don't we? :)


>
>
> But I'm also sure that there's no way that I would ever get this procedure
> to perform adequately (unless I'd find a completely different approach in
> Celko's book). All performance gains that better hardware can buy are
> linear; the performance hit for extra input data is exponential. That
> means that there's no way to keep the performance aceptable as the data
> continues to grow.

I wouldn't be too sure that the gains from better hardware
are just linear. There are multiple factors involved here.
If you are limited by memory (as you are), then, by
necessity, you increase the I/O rates to disk, which limit you
even further. You don't have gobs of money to spend on
RAID arrays and memory in your position. If possible,
a breakdown of those elapsed time numbers as to pysical
I/O vs. CPU time would let someone extrapolate.

>
> Is this a surprise to me? No, it isn't. Frankly, I've never considered a
> relational database as a good tool for storing and traversing trees. There
> are many other tools out there more suited for that task.
>
> I never entered this discussion to prove that the relational model is well
> suited for tree-structures, because frankly, I think it isn't.I only
> entered this discussion because Neo set a challenge and I decided to take
> it on - and succeeded.

[SNIP]

>
>
> Thanks, Nick!
>
> Best, Hugo

And thanks for your replies!

Nick Landsberg

unread,
May 24, 2004, 10:56:45 PM5/24/04
to
Neo wrote:

Apologies to both you and Hugo for being
dense, but I have a few questions regarding
the following:

Aha ... This would extract the "record" which
identifies "John Smith"? Including, name, rank,
social security number, shoe size, and anything else
stored along with the identifier? (Again, pardon,
I am struggling to understand).

>
> Following query finds all persons who are black (john & bob):
> (person) (black)
>
> Following query finds all persons who are black and white (only bob):
> (person) (black) (white)
>
> Following query finds all things that are white (mary & bob & fido!):
> (white)
>
>
> Other examples are posted in recent posts in comp.database.object and
> at www.xdb1.com/Example/Ex005.asp

Sorry, I run a MAC at home and can't use the .exe file.
The company frowns upon me loading stuff on to the
corporate PC. Similarly, even if I could, it wouldn't
run on the corporate "mainframe" (which is a SUN).
See remarks I made to Hugo about the size of machine.

Neo

unread,
May 25, 2004, 2:37:04 PM5/25/04
to
> > > How ... extracts just the record for "John Smith"...

> >
> > (person) "john smith"
>
> Aha ... This would extract the "record" which identifies "John Smith"?
> Including, name, rank, social security number, shoe size,
> and anything else stored along with the identifier?

Yes, it finds the thing which represents John Smith. From that thing,
remaining data can be retrieved. Below is simplified psuedo code to
find him and print his SSN:

pT = X("(person) 'john smith'");
pSSN = T_Val_get(pT, pSSN_Cls);
str = T_Name(pSSN);
print(str);

Some functions (ie X, T_Val_get) can return 0 to many things thus the
following programming structure is typical:

while (pT = X("(person) 'john smith'")){
// Process current person
}

Hugo Kornelis

unread,
May 25, 2004, 5:48:42 PM5/25/04
to
On Tue, 25 May 2004 02:43:53 GMT, Nick Landsberg wrote:

Hi Nick,

(snip)


>> The setup I used for these tests is:
>> * SQL Server 2000 SP3a, running as a service on my desktop computer, with
>> only one active application (Query Analyzer).
>> * 1.3 GHz Athlon processor / 256 MB RAM.
>
>This would be underpowered for a commercial setup
>from a CPU standpoint and from a memory standpoint.

Yes, indeed. But this computer is sitting right under my desk, with no
other users besides me. And my normal use of SQL Server does definitely
not involve navigating trees or producing what can best be described as
the cross join of a 10000 row table with itself, with extra information
based on rather complex logic.

(The 10000 row number I used here was my first "bigger" test, after some
very small tests to verify that my code worked. I tried to cancel that
query after three hours of execution time, causing my computer to crash;
after that, SQL Server needed an additional 2 hours of recovery time to
repair the damaged test database - that is when I decided to reduce the
size of my tests <g>)


(snip)


> I have no idea
>whether or not SQL-server would write to both
>data and log files for tempdb at the same time,
>but this may have increased the latency with many
>(almost) random seeks.

I don't know either. I did notice that both the data and log files for the
actual test database and the data file for tempdb grew considerably when I
ran my first tests (before I preallocated ample space); tempdb's log file
grew little or not at all.

I don't think moving tempdb's log to my other HD would have helped. With
only one active log file on that HD, there's never any need to reposition
the disk arm.


(snip)


> I am still wondering tho, about how much of that
>elapsed time was disk I/O latency vs. CPU time. It depends on the
>size of the in-memory cache (which I said above was small for
>commercial implementations) and the effectiveness of the caching
>algorithm, usually LRU.

So am I. I tried running my test script again (for 1000 things and 2883
hierarchies), but with the options SET STATISTICS TIME and SET STATISTICS
IO set to ON. (This time I didn't quit my other applications, instead I
happily continued reading & writing in Agent, testing some simple queries
in another database on the same server and checking some web pages).

The elapsed time reported by SET STATISTICS TIME is close to the elapsed
time I calculated by comparing start and end time, but there should have
been no difference at all (reported elapsed: 1106144 ms; my calculation
says 18:47:413 or 1127413 ms - a 21 sec difference!). The cpu time
reported by SET STATISTICS TIME is 939797 ms, so it looks as if there's
little time lost while waiting.

I suspect the output of SET STATISTICS IO to be incorrect. I won't
reproduce it here (it's quite long, listing stats for each query executed
in the stored proc and per table used). The part I don't believe is that
quite a lot of logical reads (from cache) are reported, but not a single
physical read (from disk). Yet, after execution of the query the amount of
space used in my test DB is allmost 230 MB and tempdb (which I had shrunk
earlier today) has grown back to just over 2000 MB! With only 256 MB
present in my computers and many other applications running at the same
time, I can't believe that this was done without any physical I/O.

I would have liked to supply accurate measurements, but it seems I'll now
have to ask about this behaviour in one of the SQL Server newsgroups
instead...


(snip rest of message)

Neo

unread,
May 25, 2004, 8:04:06 PM5/25/04
to
> When I wrote "untyped", I meant that XDb1 performs no type checking.

You are correct. Currently this burden falls upon the XDb1 user
(possibly later basic classes and associated operations will be
built-in).

> The fact that you can classify 35 as an integer doesn't solve anything.

It can. For example, if john's age 35 is classified as an integer and
mary's age 30 is classified as an integer, then user can choose to
verify they are compatible and subtract them. If mary's age thirty is
classified as a word, then user is burdened to converted it to an
integer, if possible. This is approximately what humans do when asked
what is "35 - thirty".

> I want to be able to subtract John's age from Mary's age and be sure that the
> result is their age difference, in years. That can only be achieved if the
> database can be told that EACH age should be a whole number.

There is a general philosophical difference between RM and XDb1. RM
expect things to be of pre-defined classes(ie age+integer). On the
other hand, XDb1 accommodates things of different classes and even new
classes (ie age+integer, age+word, age+expression, etc). RM is strict
and consequently very reliable. XDb1 isn't strict and consequently
less reliable (somewhat like human computers).

> Note that I expect that at least 99.9% of my future work will require the
> datatype of a value to be known to the programmer. This is based on 20
> years of experience (where the percentage has always been 100%); the 0.1%
> insecurity margin is included because I know that the unimaginable is
> usually just about to happen.

It is true that most PC programmers expect to know the data type in
advance (unless they're working on AI-type projects). There is a
computer that initially doesn't know classes such as integers,
decimals, fractions and related operations. However its OS allows it
to learn them. That computer is between our ears. Did god (assuming he
was the programmer) know what classes humans would contrive in advance
before he could program the DNA?

> Note also that MS SQL Server, the RDBMS I do most work on at the moment,
> supports a datatype "sql_variant", that will store basically everything. I
> have never used this datatype, not have I ever seen a question in any SQL
> Server related newsgroup (and I participate in quite a few of them) where
> this datatype would be the answer. And I expect I never will.

In the provided solution, there is a separate table for attribute
values of each data type: ie T_Attrib_int, T_Attrib_char, etc. It
might be possible to replace those with a single table named
T_Attrib_variant.

Neo

unread,
May 25, 2004, 8:37:05 PM5/25/04
to
> The choice to use one syntax for modifying both intension and extension,
> as opposed to the clear seperation of DML and DDL in SQL, gives me the
> impression that end users of XDb1 databases will be allowed to use this
> language to modify both the intension and the extension at will. While
> that may be acceptable in some situations, it usually isn't,

You are correct (XDb1 is partially aimed at AI-type applications where
such modifications maybe more appropriate and where the programmer may
be a program).

> OTOH, you sometimes write that the app should check things, like in the
> above quote. That can only be done if the english-like language is not
> made available to end users. But in that case, why would you bother to use
> that language in the first place?

Just as a custom program can provide checks before entering info in a
SQL Server db, so can a custom app provide checks before entering info
into an XDb1 db. The standard XDb1 interface is extremely lenient. A
custom app can be much stricter. A custom app access the db via a DLL.

Neo

unread,
May 25, 2004, 9:22:10 PM5/25/04
to
> XDb1's parser will gladly accept ages of 7, "over the top", "slightly
> older than I am" or even "I think therefore I am" without bothering to
> check if it has any overloaded operators that are able to do anything
> useful with that data.

It is true that at the db engine level, it will accept any of the
above. The burden falls upon the user to further classify 7 as an
integer, thirty as a word, over-the-hill as an expression. The user
may take on the burden of creating an operation valid on instances of
the word class that attempts to convert it to a numerical value, which
can then be used to compare thirty to 7. In addition, each thing can
have properties (even if the thing itself is a property or value). In
the case of over-the-hill, it may have properties such as lowerLimit
and upperLimit which could help to make comparisons with 7.

Things in XDb1 can already inherit properties from their classes.
Although, it hasn't been implemented, I can envision things inheriting
functions from their classes also. Thus, imagine, user creates
function ConvertToNumber() and associates it with the word class. Now
user create an instance of word named "thirty". Then
thirty.ConvertToNumber() would return 30.

Gene Wirchenko

unread,
May 25, 2004, 10:01:43 PM5/25/04
to
Hugo Kornelis <hugo@pe_NO_rFact.in_SPAM_fo> wrote:

[snip]

>The elapsed time reported by SET STATISTICS TIME is close to the elapsed
>time I calculated by comparing start and end time, but there should have
>been no difference at all (reported elapsed: 1106144 ms; my calculation
>says 18:47:413 or 1127413 ms - a 21 sec difference!). The cpu time
>reported by SET STATISTICS TIME is 939797 ms, so it looks as if there's
>little time lost while waiting.

What about the possibility of round-off error?

[snip]

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
I have preferences.
You have biases.
He/She has prejudices.

Neo

unread,
May 26, 2004, 12:39:29 PM5/26/04
to
> If the requirements had even allowed two things of the same class to share
> the same name (e.g. two persons named 'john'), I would have asked for more
> information, because an RDBMS requires a possibility to identify each
> individual row in a table. (I don't know if XDb1 has such a requirement; I
> hope it does, because there is not much use in storing data about things
> if you have no way of knowing WHICH thing you're storing data about

XDb1 allows user to store two persons with the same properties (ie.
named john) in db. If the user doesn't want this to happen, the burden
falls upon the user to verify a similar thing does not already exist
before creating new one. XDb1's NLI, GUI and API allow the user to do
so. Below is psuedo code that allows user to limit entry to just one
person named john.

// Find number of persons named john in db
cntr = 0
while ( pT = X("(person) john") ){
cntr = cntr + 1;
}

// If no persons named john exist, create one
if (cntr == 0) {
pJohn = T_create()
T_relate(pJohn, rCls, pPerson)
T_Name_relate(pJohn, "john")
}

> if there are two persons named john in my office,
> what use is the statement 'john is 35' to me?)

If there are two or more johns in the db, XDb1's GUI responds 'john is
ambigious, please select it'. Via code, user can determine that there
are more than one john and can choose to modify none, all, or prompt
user for further help.

Neo

unread,
May 26, 2004, 1:25:42 PM5/26/04
to
> Note however that things without name tend to be a PITA when
> you want to do anything useful with it, unless other attributes are stored
> that can be used to identify exactly which thing one wants to discuss.

XDb1 can accept two things whose attributes do not allow them to be
distinguished (ie two persons named john). Why? Because such
situations can come up in the real world.

> This incompleteness shows if you try to enter an unnamed person, an
> unnamed dog and an unnamed computer. Changing all 'john', 'fido' and
> 'laptop1' to '*' in the script is a good way to mess things up big time.

Yes, XDb1's NLI is limited and thus user would need to use the GUI or
API in the case of multiple unnamed things.

> While multiple unnamed things are indeed allowed, it becomes impossible to
> explain to XDb1 which of the unnamed things I'm referring to. As far as
> I'm concerned, this shows why it's generally better to insist on naming
> things.

I agree it is better to name things from a db/programmer's point of
view, but sometimes reality doesn't provide one. For example,
individual desks, chairs, pens, etc usually don't have names? It is
possible to tell XDb1 exactly which unamed thing via GUI and API, but
not NLI.



> Of course, you already pointed out that the NLI is incomplete. I'm sure
> that you'll find a way to improve this. But please do note that I can
> change my SQL Server database to allow nameless things without having to
> contact the databse vendor, whereas an XDb1 user is stuck with the current
> functionality until you find the time to fix this.

User can use XDb1's current GUI or API to handle multiple unnamed
things.

> Both implementations do not provide full support for things without name,
> things sharing a name or things having multiple names.

True but XDb1's data/schema does allow each thing to have 0 to many
names while the provided solution's data/schema does not.

Neo

unread,
May 26, 2004, 2:35:43 PM5/26/04
to
> > Another isssue with the provided implementation is that
> > the class hierarchy is not modelled faithfully.
>
> You obviously didn't check my model too well.

My apologies. Because I was trying to limit the complexity of
understanding the provided solution, I had removed tables (T_Types and
T_Attribs) from the script that were not being accessed to generate
the report. And when looking thru the abridged script, I could only
conclude (wrongly) that the class hierarchy was simply encoded in the
table named things since I didn't find it in the expected hierarchies
table.

However, after looking thru the original script, I find the following
lack of genericness and normalization.

1. A minor (but crucial) point. The data in types table does not
encode the class of force, church, person, dog and computer which
should be thing. Probably the reason it didn't is because there is
(correctly) no record for thing in things table and RM doesn't provide
a way to reference a table, only records in a table (XDb1 is different
in this respect, it can in effect, reference a record or a table
equally). While some might say, it is implicit that person isa thing,
I would say this is failure to properly model the relationship between
data. In the provided solution, the method used to get the class of
john does not work to get the class of person. In XDb1,
T_Ref_Xth(pAnyThing, rCls) provides a systematic method to determine
the classes of any thing.

2. Because the procedure which creates the report utilizes hierarchies
table (and not the types table), it is impossible to create a nearest
common ancestor report for the class hierarchy (without adding
redundant data). In XDb1, because all hierarchies are encoded
similarly, the procedure which creates the report handles the class
hierarchy the same as any other.

3. But more importantly and fundamentally, encoding the class
hierarchy in tables other than the hierarchies table is
redundant/unnormalized.

Neo

unread,
May 26, 2004, 2:48:55 PM5/26/04
to
> > In XDb1's db, 'john isa thing' is not stored because it can be derived.
> > Since 'isa' is transitive 'john isa thing' is derived
> > form 'john isa person' and 'person isa thing'.
>
> In my database, 'john isa thing' is not stored either, because it can be
> derived. Everything is a thing.

Show me a method to determine person's class with the provided
solution and I will show you how it will fail if the data is changed
(ie the name of thing is changed from 'thing' to 'thingy' or
'object'). This failure will not occur with XDb1.

Neo

unread,
May 26, 2004, 5:09:45 PM5/26/04
to
> > Actually the provided solution is inconsistent in that the class
> > hierarchy of things ... isn't encoded in the hierarchies table.

> > It is instead encoded in a dedicated table named types.
>
> Yes, indeed. That's because I normalized the data (as requested in the
> original message). Note: see my other message as well, where I provide an
> additional table to keep track of the class hierarchy.

Even structure is data. The structure to store all hierarchies needs
to be normalized. In the provided solution, the leader hierarchy is
stored in the hierarchies table, as expected. But the class hierarchy
is stored in types tables (which makes that hierarchy inaccessible to
the procedure that generates the report). The fact that hierarchies
are stored in two different tables is redundant/unnormalized. For
example, in XDb1, if user right clicks the tree root node labeled
'thing' (which is the root of the class hierarchy), the nearest common
ancestor class report is generated.

> > The procedure that generates the report utilize hierarchies table
> > and not the types table.
>
> Of course. If you'd replace the statement 'john isa person' by 'john isa
> dog' but leave all 'xxx leader yy' statements intact, the Common Ancestor
> Report would remain unchanged.

I think, you may be assumming I still want the nearest common ancestor
leader among things. I want the nearest common ancestor class among
things. How can I do that with the provided solution without the db
having redundant data?

Neo

unread,
May 26, 2004, 5:31:46 PM5/26/04
to
> > ...the provided solution doesn't properly model ...the class hierarchy.
>
> Oh, so THAT is what you want! Why didn't you say so directly from the
> start? Because, you see, your original post failed to point this out.

Yes, the term ancestor was amibigious in the sense it was meant to
apply to any of hierarchy (class/instance, parent/child, asm/part,
leader/follower, etc)

> But since I'm in a good mood, I'll provide you with the necessary change
> free of charge (note that I changed the name of the types table to
> classes, if you didn't change "references classes" to "references types")
>
> CREATE TABLE ClassHierarchy (SubClass varchar(20) not null ...
> SuperClass varchar(20) not null ...
> primary key(SubClass, SuperClass))

I consider a solution that stores the class hierarchy in
T_ClassHierarchy (formerly T_Types) and the remaining hierarchies in
T_Hierarchies to be structurally redundant/unnormalized (Yes, even
structure is data in my book). Why is one hierarchy being stored in a
different table than the remaining?

Neo

unread,
May 26, 2004, 7:36:38 PM5/26/04
to
> > > Here are Date's formal definitions for the five normal forms ...
> >
> > The problem ... is that they are RM specific.
>
> Exactly how is that a "problem"?

The problem is, when I used the term normalized, I meant it in the
more general sense which can be applied to any data model [replace
redundant data with refs (system supported links that are unrelated to
data ie IDs in RM) to the one and only original within db). But I can
see how the term caused confusion.

> You didn't set a challenge to make a report using a model that is
> normalized according to XDb1's own definition of normalisation and then
> kludged in an RDBMS.

XDb1 doesn't define the general form of normalization. Neither does
RM. But the general form of normalization applies to all data models.
I want RM's solution to be normalized in the general sense of the word
(which apparently is a superset of the 5/6 "normal forms"). Even C. J.
Date states "the purpose of such reduction is to avoid redundancy" in
his chapter titled "Further Normalization I: 1NF, 2NF, 3NF, BCNF".

> Table hierarchies
> thing otherThing hierarchy
> ------ ---------- ---------
> 'fido' 'john' 'leader'
> 'fido' 'mary' 'leader'
> 'fido' 'luke' 'leader'
> My solution had no redundancy ... but my point should already be clear:
> there is no redundancy. You seem to be thinking that
> storing the character strings 'fido' and 'leader' several times is bad.

Redundancy is plainly obvious to me, so we must be working from
different definitions of normalization. Redundancy makes a solution
less generic and more prone to problems over a border scope. For
instance, the above table's design prevents it from handling two
things with the same name or a thing with no name.

> for argument's sake assume ...
> That would allow me to remove all 'fido's and 'leader's from the
> hierarchies table. Instead, it would now look like this:
>
> Table hierarchies
> thingID otherThingID hierarchyID
> --------- ------------ -----------
> (snip some rows)
> 17 3 1
> 17 4 1
> 17 5 1
>
> And the table things would read:
> thingID thingName
> ------- ---------
> 3 'john'
> 4 'mary'
> 5 'luke'
> 17 'fido'

> But if I follow your suggestion of "improving", I end up with the
> same data in the existing tables and some extra relations that
> I could do without.

The above design is an improvement because it increases genericness.
Two things can now have the same name. Also, the above design is a
stepping stone to the next design which would allow a thing to be name
less and not incur NULLs.

The above design is an improvement because it reduces redundant data.
The duplicate 17 in T_Hierarchies and T_Thing is not considered
redundant (with respect to dbs) because it is unrelated to data being
modelled. Separating data from the mechanism that links things with a
db, allows maximum flexibility under a broader scope (ie allowing
unamed things).

Neo

unread,
May 26, 2004, 8:18:36 PM5/26/04
to
> But what if I had bought XDb1 and actually *needed* this report
> - including the need to distinguish john the dog from john the person?
> The only way to accomplish this would be to contact you
> and ask if you could please make a patch to enable this for me.

The above is not entirely correct. The limitations of XDb1's NLI do
not prevent a user from completing the tasks via GUI. And the user can
generate the same or modified report by making calls to the API via a
programming language such as C. The code to generate the report is not
embedded in XDb1's engine. The code and calls to the API to generate
the report are shown at the website.

> So I immediately knew that you didn't mean "as generic as XDb1", since if

> you had you'd surely have provided full details on how generic XDb1 is.

Actually I did mean "as generic as XDb1" in my mind and it is true
that I didn't spell out all the details, since normalizing down to
atomic symbols has no meaning to most, and I was expecting an
iterative process to get an RM implemenation similar enough in
genericness and normalization to make a fair comparison. Apologies for
the confusion.

Lee Fesperman

unread,
May 27, 2004, 4:23:38 AM5/27/04
to
Neo wrote:
> The problem is, when I used the term normalized, I meant it in the
> more general sense which can be applied to any data model [replace
> redundant data with refs (system supported links that are unrelated to
> data ie IDs in RM) to the one and only original within db). But I can
> see how the term caused confusion.

I'm really tired of half-baked interpretations of the relational model by you and others
of your ilk (vendors of ad-hoc database systems).

> XDb1 doesn't define the general form of normalization. Neither does
> RM. But the general form of normalization applies to all data models.
> I want RM's solution to be normalized in the general sense of the word
> (which apparently is a superset of the 5/6 "normal forms"). Even C. J.
> Date states "the purpose of such reduction is to avoid redundancy" in
> his chapter titled "Further Normalization I: 1NF, 2NF, 3NF, BCNF".

There is no "general form" of normalization. Rather than educating yourself on the
issue, you take a summary of the concept (a single sentence in this case) and go off on
a ridiculous tangent.

Items like 'john' and 'o' (as in 'john', 'god', 'neo') are values. Non-redundant storage
of values is a physical issue. Normalization deals with the logical view of data.
Normalization is concerned with information or "facts", not storage of values.

> Redundancy is plainly obvious to me, so we must be working from
> different definitions of normalization. Redundancy makes a solution
> less generic and more prone to problems over a border scope.

This makes no sense. You're just making this up as you go along. Much like your
ramblings about the human brain (which are OT in the comp.* branch).

Note: Your challenge is a farce. As I expected, you are just playing tricks to avoid
paying out the $1000.

--
Lee Fesperman, FirstSQL, Inc. (http://www.firstsql.com)
==============================================================
* The Ultimate DBMS is here!
* FirstSQL/J Object/Relational DBMS (http://www.firstsql.com)

Neo

unread,
May 27, 2004, 1:18:55 PM5/27/04
to
> > C. J. Date states "the purpose of such reduction is
> > to avoid redundancy" in his chapter titled "Further Normalization..."

>
> There is no "general form" of normalization.

If one analyzes the central theme/goal of normalization in RM (and
that in other data models), it is reasonable to extrapolate that the
general goal of normalization is eliminating redundancy, just as C.J.
Date did.

> Items like 'john' and 'o' (as in 'john', 'god', 'neo') are values.

> Normalization deals with the logical view of data.
> Normalization is concerned with information or "facts", not storage of values.

The problem with the above is that values are data.

> Non-redundant storage of values is a physical issue.

So how does this allow the provided solution to accommodate two things
with the same name or a thing without a name or a thing with multiple
names (without incurring NULLs).

> > Redundancy is plainly obvious to me, so we must be working from
> > different definitions of normalization. Redundancy makes a solution
> > less generic and more prone to problems over a border scope.
>
> This makes no sense.
> You're just making this up as you go along.

Please see other posts in this thread for examples of how redundancy
can make a solution less generic and more prone to problems when the
solution is extended to a boarder scope.

Hugo Kornelis

unread,
May 27, 2004, 4:32:25 PM5/27/04
to
On 26 May 2004 16:36:38 -0700, Neo wrote:

Hi Neo,

>> > > Here are Date's formal definitions for the five normal forms ...


>> >
>> > The problem ... is that they are RM specific.
>>
>> Exactly how is that a "problem"?
>
>The problem is, when I used the term normalized, I meant it in the
>more general sense which can be applied to any data model [replace
>redundant data with refs (system supported links that are unrelated to
>data ie IDs in RM) to the one and only original within db). But I can
>see how the term caused confusion.

Well, when you set a challenge, it's not what you *think* that counts, but
what you *write*. And what you wrote didn't cause any confusion to me:

"(...) using the relational model (...) from normalized and
NULL-less data (...)"

I don't see how such a clear statement might cause confusion.


>> You didn't set a challenge to make a report using a model that is
>> normalized according to XDb1's own definition of normalisation and then
>> kludged in an RDBMS.
>
>XDb1 doesn't define the general form of normalization. Neither does
>RM. But the general form of normalization applies to all data models.
>I want RM's solution to be normalized in the general sense of the word
>(which apparently is a superset of the 5/6 "normal forms"). Even C. J.
>Date states "the purpose of such reduction is to avoid redundancy" in
>his chapter titled "Further Normalization I: 1NF, 2NF, 3NF, BCNF".

That is the purpose indeed. And boy, lots and lots of redundancy gets
removed when you take an unnormalized design and drill down through all
Date's normal forms. So the method proscribed by Date seems to fit the
stated purpose well.

It would be more convincing if you could produce a quote from Date where
he says that "normalization is not complete until you can be assured that
each character of the alphabet gets stored at most one time, but pointers
may be stored as often as needed". Think you can find something like that
in Date's writings?


>> Table hierarchies
>> thing otherThing hierarchy
>> ------ ---------- ---------
>> 'fido' 'john' 'leader'
>> 'fido' 'mary' 'leader'
>> 'fido' 'luke' 'leader'
>> My solution had no redundancy ... but my point should already be clear:
>> there is no redundancy. You seem to be thinking that
>> storing the character strings 'fido' and 'leader' several times is bad.
>
>Redundancy is plainly obvious to me, so we must be working from
>different definitions of normalization.

That's what I've been trying to tell you all the time! I've been working
from the definition of normalization that is defined by Date in the well
known normal forms, a definition that is accepted by all experienced DBA's
and programmers of RDBMS systems. You've been working from a definition of
normalization that you've made up for yourself and, to put it bluntly,
that is accepted by nobody but you yourself.

Which wouldn't have bothered me if you had taken the trouble of telling
the world, right from the off, in the message that started this thread,
that the challenge was for an implementation using an RDBMS but according
to Neo's personal definition of normalization. But you didn't. Your
message explicitly stated that to win the challenge, I had to be "using
the relational model". Date's definitions of normalization are part of the
relational model. Yours aren't. QED.

I already told you before: the relation model is about storing facts
(represented as relationships), not about storing names. Names are like
pointers (not to a memory location, but to something in the real world).

Redundancy means that you can take something away without changing the
meaning of the whole. I'll give you two facts that share one name, now you
show me how you can take something away from those two facts without
changing the meaning of the whole:

1) Neo owes Hugo $1000
2) The International Bank Account Number (IBAN) of Hugo's bank account
is NL59 RABO 0118 3365 68.


> Redundancy makes a solution
>less generic and more prone to problems over a border scope.

Less generic? No. More prone to problems? Yes. But since there is no
redundancy in my implementation (neither in the first, nor in the second
version), this is irrelevant.


> For
>instance, the above table's design prevents it from handling two
>things with the same name or a thing with no name.

That can be accomplished by an RDBMS as well, but it was not a requirement
for the challenge. Making something more generic has a price. In an RDBMS,
that price is usually a performance hit. I've grown accustomed to making
my implementations exactly as generic as required. Not less (of course),
but neither more (since in the real world, performance DOES matter).

Now you may think that bad practice, but my customers seem to like having
applications that perform well while offering all flexibility they need
over having an over-generic system that slowly slugs along.

Oh no, now you've done it. That last sentence will probably set off the
next Holy War between the Natural Key Group vs the Artificial Key Group.

I've seen quite enough of them Holy Wars, thank you very much. I'm a
practical guy. While the Holy Followers Of The One And Only True Belief
are attampting to flame the Holy Followers Of The Other One And Only True
Belief to death in the newsgroups, practical people like me just pick
either a natural key or an artificial key, whichever suits them best.

Oh and it might interest you that the defenders of the artificial key
never claim that an artificial key is "more relational" or "less
redundant" than a natural key - they generally defend their preferred
choice because it's more practical and quicker. The defenders of the
natural key *do* sometimes use arguments from relational theory to make
clear why an artificial key should not be used in an RDBMS. Try googling
one of the SQL Server newsgroups for "Joe Celko" +identity and you're
bound to find something.

Hugo Kornelis

unread,
May 27, 2004, 4:34:47 PM5/27/04
to
On Thu, 27 May 2004 08:23:38 GMT, Lee Fesperman wrote:

(snip)


>Note: Your challenge is a farce. As I expected, you are just playing tricks to avoid
>paying out the $1000.

It seems that getting Neo to live up to his word is going to be the *real*
challenge... :-)

Hugo Kornelis

unread,
May 27, 2004, 4:37:01 PM5/27/04
to
On 27 May 2004 10:18:55 -0700, Neo wrote:

>> > C. J. Date states "the purpose of such reduction is
>> > to avoid redundancy" in his chapter titled "Further Normalization..."
>>
>> There is no "general form" of normalization.
>
>If one analyzes the central theme/goal of normalization in RM (and
>that in other data models), it is reasonable to extrapolate that the
>general goal of normalization is eliminating redundancy, just as C.J.
>Date did.
>
>> Items like 'john' and 'o' (as in 'john', 'god', 'neo') are values.
>> Normalization deals with the logical view of data.
>> Normalization is concerned with information or "facts", not storage of values.
>
>The problem with the above is that values are data.

Data. Not facts.

Check what Lee writes: Normalization is concerned with information or
"facts".

Facts. Not data.

Hugo Kornelis

unread,
May 27, 2004, 4:50:55 PM5/27/04
to
On 26 May 2004 14:31:46 -0700, Neo wrote:

>> > ...the provided solution doesn't properly model ...the class hierarchy.
>>
>> Oh, so THAT is what you want! Why didn't you say so directly from the
>> start? Because, you see, your original post failed to point this out.
>
>Yes, the term ancestor was amibigious in the sense it was meant to
>apply to any of hierarchy (class/instance, parent/child, asm/part,
>leader/follower, etc)

There is no parent/child hierarchy in the challenge, nor any asm/part
hierarchy. There is a class/instance relationship, but the challenge did
not contain anything that might lead to the conclusion that this
relationship was hierarchic as well.


>> But since I'm in a good mood, I'll provide you with the necessary change
>> free of charge (note that I changed the name of the types table to
>> classes, if you didn't change "references classes" to "references types")
>>
>> CREATE TABLE ClassHierarchy (SubClass varchar(20) not null ...
>> SuperClass varchar(20) not null ...
>> primary key(SubClass, SuperClass))
>
>I consider a solution that stores the class hierarchy in
>T_ClassHierarchy (formerly T_Types) and the remaining hierarchies in
>T_Hierarchies to be structurally redundant/unnormalized (Yes, even
>structure is data in my book). Why is one hierarchy being stored in a
>different table than the remaining?

Because I tacked a quick fix onto an existing model to meet the extra
requirements you were starting to throw in. Because the original
requirements only mentioned things; things being instances of a class and
things forming a hierarchic "leader/follower" relationship.

You are right that this is a kludge. I should redo my model right from the
start. Replacing the things and the classes tables with one catch-all
table. And using the existing hierarchies table to store both the "leader"
relationships and the "isa" relationships as well. I wouldn't even have to
change the part of the code that actually generates the report!

On the other hand - if I do a fixed-price job for a customer and he starts
changing the requirements after the job is finished, I bill him for the
extra hours. Will you pay me extra if I rework my implementation as
described above? No need to pay my normal hourly rate - just say how much
you'll pay. Good chance that I'll bite.

Of course, I do still expect the $1000 you owe me for fulfilling all
requirements set forth in your original challenge message. I assume that
you've already paid and that delays caused by international banking issues
are the reason that I didn't receive anything yet?

Hugo Kornelis

unread,
May 27, 2004, 4:56:30 PM5/27/04
to
On 26 May 2004 14:09:45 -0700, Neo wrote:

>> > Actually the provided solution is inconsistent in that the class
>> > hierarchy of things ... isn't encoded in the hierarchies table.
>> > It is instead encoded in a dedicated table named types.
>>
>> Yes, indeed. That's because I normalized the data (as requested in the
>> original message). Note: see my other message as well, where I provide an
>> additional table to keep track of the class hierarchy.
>
>Even structure is data. The structure to store all hierarchies needs
>to be normalized. In the provided solution, the leader hierarchy is
>stored in the hierarchies table, as expected. But the class hierarchy
>is stored in types tables (which makes that hierarchy inaccessible to
>the procedure that generates the report). The fact that hierarchies
>are stored in two different tables is redundant/unnormalized.

See the message that I posted before this one.


> For
>example, in XDb1, if user right clicks the tree root node labeled
>'thing' (which is the root of the class hierarchy), the nearest common
>ancestor class report is generated.

Yes indeed. And it looks like this:

Common Ancestor Report for 'thing'
ThingX ThingY CmnAnc Dist
Time elapsed: 15 msec

Wow! I'm impressed. 15 msec to produce ... nothing!!


>> > The procedure that generates the report utilize hierarchies table
>> > and not the types table.
>>
>> Of course. If you'd replace the statement 'john isa person' by 'john isa
>> dog' but leave all 'xxx leader yy' statements intact, the Common Ancestor
>> Report would remain unchanged.
>
>I think, you may be assumming I still want the nearest common ancestor
>leader among things. I want the nearest common ancestor class among
>things. How can I do that with the provided solution without the db
>having redundant data?

See the message that I posted before this one.

It is loading more messages.
0 new messages