Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The relational model and relational algebra - why did SQL become the industry standard?

17 views
Skip to first unread message

Amund Trovåg

unread,
Feb 7, 2003, 7:34:13 AM2/7/03
to
Hi all DB-theorists! :)

I am currently trying to work out why SQL and not relational algebra became
the industry standard in 1986(I think it was). Is this because of the more
"technical" language of rel. algebra and the difficulty in creating
universal queries, or is it something else?

I am a student at a master's program in information science, and I am trying
to see if there is a reason for doing my dissertation on something that
involves making relational algebra more usable, by e.g. making new
operators. Right now I am trying to figure out why relational algebra is a
dead language.

Thanks in advance for any help and pointers.

regards,
Amund Trovåg


Finarfin

unread,
Feb 7, 2003, 9:14:03 AM2/7/03
to
Amund Trovåg wrote:


Hello:

I think you will find people who claim that relational algebra is
NOT dead. See "Foundation for Future Database Systems: The Third
Manifesto", By C.J. Date and Hugh Darwen, ISBN 0-201-70928-7,
published by Addison Wesley Longman, Inc. Also, there is a company
called (some corrections may be forthcoming here) Alaphor with a
product called Dataphor that claims to follow the D language
described in the book. You will see some posts in this newsgroup
related to this book etc. as well. Search for "TTM" or "Tutorial D"
and also under the title of the book and the authors. Also of
interest could be http://www.dbdebunk.com/.

Hope that helps.

John Eggert

Anton Versteeg

unread,
Feb 10, 2003, 5:02:50 AM2/10/03
to
There were some attempts to create RDBMS's that were more like relational
algebra. IBM had a RDBMS in the 80's called BS12 (the 12 relational operators
from Codd). See article by Hugh Darwen:
http://www.mcjones.org/System_R/bs12.html

"Amund Trovåg" wrote:

--
Anton Versteeg
DB2 Specialist
IBM Netherlands


Alfredo Novoa

unread,
Feb 10, 2003, 8:44:18 AM2/10/03
to
"Amund Trov g" <am...@texassibir.com> wrote in message news:<b208pe$27s0$1...@toralf.uib.no>...

> Hi all DB-theorists! :)
>
> I am currently trying to work out why SQL and not relational algebra became
> the industry standard in 1986(I think it was). Is this because of the more
> "technical" language of rel. algebra and the difficulty in creating
> universal queries, or is it something else?

I don't think it was due to strong technical reasons.

SQL was designed by IBM labs, and IBM had an have a lot of power on
the IT industry.

SQL is the VHS of the database languages. Quel could be the Betamax
(better but dead).

I recomend you this little book:

THE DATABASE RELATIONAL MODEL:
A RETROSPECTIVE REVIEW AND ANALYSIS
(ISBN: 0201612941)

Codd prefered the relational calculus as the basis of relational
languages instead of the relational algebra.

This could be one of the causes because most of the first relational
languages were based on the relational calculus.

By the way, Date has doubts about the strength of Codd's arguments.

Another reason could be that SQL designers thougth that relational
algebra and calculus look too mathematical for the average
practicioner, and they tried to make something different. But when
they saw that the results were not very good they introduced again
some features from the relational calculus.

SQL started as a simplified lab prototype, and its very poor scalar
type support was probably due to that.

> I am a student at a master's program in information science, and I am trying
> to see if there is a reason for doing my dissertation on something that
> involves making relational algebra more usable, by e.g. making new
> operators. Right now I am trying to figure out why relational algebra is a
> dead language.

As Finarfin said, the example language of The Third Manifesto is based
on the relational algebra.

www.thethirdmanifesto.com

IMO the relational algebra is more or less as usable as the relational
calculus. Some expressions are easier using the relational algebra and
some others are easier with the relational calculus.

> Thanks in advance for any help and pointers.

Good luck.


Regards
Alfredo

David Cressey

unread,
Feb 10, 2003, 12:54:59 PM2/10/03
to
Amund,

Your question took me back to 1986. What follows is personal retrospective,
and not history.
You'll have to do the digging through the sources.

A couple of years earlier, I had learned a relational DBMS, Rdb/VMS, which
ran on the DEC VAX. It had an interface language of its own, called RDO,
that vaguely resembled another language, Datatrieve, that DEC had sold
earlier, originally on the PDP-11.

When RDB/VMS version 3 came out, in about 1986, it featured a "new"
interface, SQL. At first it was "in addition" to RDO. Then in later
versions of Rdb/VMS, RDO was declared more or less "stabilized", and SQL
eventually became the principal interface language.

Why?

My recollection was that is was to compete with IBM. DB2 used SQL, and it
was easier to sell Rdb if the people who already knew SQL didn't have to
learn something else. This is the way most "de facto" standards come into
being.

I hope you'll be able to confirm or correct this little bit of recollection,
and to find out how some other products came to use SQL as the interface
language.


Lauri Pietarinen

unread,
Feb 10, 2003, 2:14:16 PM2/10/03
to
> I am currently trying to work out why SQL and not relational algebra became
> the industry standard in 1986(I think it was). Is this because of the more
> "technical" language of rel. algebra and the difficulty in creating
> universal queries, or is it something else?

Good question!

It is probably related to many things, among them
- the supremity of IBM during 70's and 80's
- being afraid of having to remove duplicates,
because of (alleged) potential performance implications -
hence SQL's pseudu-tuple-bag model
(see http://www.dbdebunk.com/cjddtdt.htm)
- basically the fact that the relational model
as presented by Codd in 1970 was such a
radical departure from the traditional way
of thinking that a watered down version was
(and still is!) the only version that the
public can stomach

The product mentioned in another posting in this thread is
called Dataphor; more information can be found at
www.alphora.com.

There is a web site dedicated to The Third Manifesto
(www.thethirdmanifesto.com). It does not have
as much information as the book but it is a good start.

I see that Hugh Darwen has made his presentation
"The Askew Wall" available on the site.

regards,
Lauri Pietarinen

Amund Trovåg

unread,
Feb 11, 2003, 5:09:20 AM2/11/03
to

"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message
news:e9d83568.03021...@posting.google.com...

Thank you very much Mr Piertarinen, and everyone else replying to this
thread.

After reading quite a lot about this very interesting subject that I didn't
know was so controversial as it seems to be, I'd like to extend my question
a bit further.

First and foremost it seems that CJ Date and Pascal are the most vocal of
the critcs of the SQL language, and they seem to be making a whole host of
valid arguments of the website Database Debunkings(www.dbdebunk.com). Are
these arguments considered correct and valid by DB-theorists in general, or
is this a two-man crusade type of thing?
The article Double Trouble, Double Trouble Part I by
Date(http://www.dbdebunk.com/cjddtdt.htm) is especially interesting, as it
concerns the inclusion of duplicate tuples(rows) in SQL-based DBs. I have
also heard of, although never practised, optimizing a DB with duplicates.
Are there arguments FOR such a strategy? If not, as Date claims there isn't,
why would people create duplicates for optimization?
Ignorance or deeper-knowledge of the shortcomings of SQL?

Best regards
Amund Trovåg


Jan Hidders

unread,
Feb 11, 2003, 5:58:09 AM2/11/03
to
Amund Trovåg wrote:
>
>First and foremost it seems that CJ Date and Pascal are the most vocal of
>the critcs of the SQL language, and they seem to be making a whole host of
>valid arguments of the website Database Debunkings(www.dbdebunk.com). Are
>these arguments considered correct and valid by DB-theorists in general, or
>is this a two-man crusade type of thing?

From personal experience I would say that they are widely considered correct
by DB theorists, even by those that don't care much for his other writings.

-- Jan Hidders

Lauri Pietarinen

unread,
Feb 12, 2003, 4:28:59 PM2/12/03
to
Jan,

what is your take on Garcia-Molina, Ullman and Jennifer Widom
regarding their stand on duplicates?

(see http://www.dbdebunk.com/cjddtdt.htm and cjddtdt2)

Are they just realists accepting that SQL is the de facto
standard, so one might as well take it as a basis?

Or do they actually think that duplicates are OK
and cause no harm?

Lauri

Jan Hidders

unread,
Feb 12, 2003, 6:03:20 PM2/12/03
to
Lauri Pietarinen wrote:
>
>what is your take on Garcia-Molina, Ullman and Jennifer Widom
>regarding their stand on duplicates?
>
>(see http://www.dbdebunk.com/cjddtdt.htm and cjddtdt2)
>
>Are they just realists accepting that SQL is the de facto
>standard, so one might as well take it as a basis?

There are two separate questions here:
1. Do we want duplicates in the data model, i.e., in the original relations
and the results of queries?
2. Do we want duplicates in intermediate results?

I'm not completely sure what their answer to 1. is but I suspect it is
something like "probably not". But how your algebra looks depends on how you
answer question 2, because query optimization is the main raison d'etre of
the algebra, and there it is a completely different story. It can for
example be more efficient to postpone duplicate elimination. If you don't
have a bag algebra you cannot express this in your algebra.

Note that in the writings you mention Date only addresses the first
question, where what you actually asked concerned mostly the second
question.

-- Jan Hidders


Lauri Pietarinen

unread,
Feb 13, 2003, 6:27:51 AM2/13/03
to
>
>
>There are two separate questions here:
>1. Do we want duplicates in the data model, i.e., in the original relations
> and the results of queries?
>2. Do we want duplicates in intermediate results?
>
>I'm not completely sure what their answer to 1. is but I suspect it is
>something like "probably not".
>

That's not the impression I get from Date's article (see part II), but I
have not read the book.

< quotes from book Hector Garcia-Molina, Jeffrey D. Ullman, and
Jennifer Widom, DATABASE SYSTEM IMPLEMENTATION>

[Relational] algebra was originally defined as if relations were sets
[sic!--italics added].Yet relations in SQL are really bags ... Thus, we
shall introduce relational algebra as an algebra on bags.

...

For instance, you may have learned set-theoretic laws such as A
INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is
formally the "distributive law of intersection over union." This law
holds for sets, but not for bags.

< quotes from book/ >

This does not look like it is dealing with intermediate results only...

> But how your algebra looks depends on how you
>answer question 2, because query optimization is the main raison d'etre of
>the algebra, and there it is a completely different story. It can for
>example be more efficient to postpone duplicate elimination. If you don't
>have a bag algebra you cannot express this in your algebra.
>

I don't think that anybody is suggesting that intermediate results need
to remove duplicates. It's
the end result that counts. E.g. in the following code fragment

int i;
i = 5;
i = 6;
System.out.println(i);

an optimizing compiler would not bother update i to 5 because nobody is
interested in that intermediate value. In the same spirit intermediate
values
of relational expressions would be of interest only to the system
internally.

>Note that in the writings you mention Date only addresses the first
>question, where what you actually asked concerned mostly the second
>question.
>

See above...

regards,
Lauri Pietarinen


--
________________________________________________________________

Lauri Pietarinen, Senior Consultant, Databases

AtBusiness Communications Oyj, Kaapeliaukio 1, FIN-00180 Helsinki

tel. +358-9-2311 6632, mob. +358-50-594 2011, fax +358-9-2311 6601
http://www.atbusiness.com, email: lauri.pi...@atbusiness.com
_____________________________________________________________________


Paul Vernon

unread,
Feb 13, 2003, 5:54:30 AM2/13/03
to
"Jan Hidders" <hid...@REMOVE.THIS.ua.ac.be> wrote in message
news:3e4ad...@news.ruca.ua.ac.be...

> Lauri Pietarinen wrote:
> >
> >what is your take on Garcia-Molina, Ullman and Jennifer Widom
> >regarding their stand on duplicates?
> >
> >(see http://www.dbdebunk.com/cjddtdt.htm and cjddtdt2)
> >
> >Are they just realists accepting that SQL is the de facto
> >standard, so one might as well take it as a basis?
>
> There are two separate questions here:
> 1. Do we want duplicates in the data model, i.e., in the original relations
> and the results of queries?
> 2. Do we want duplicates in intermediate results?
>
> I'm not completely sure what their answer to 1. is but I suspect it is
> something like "probably not". But how your algebra looks depends on how you
> answer question 2, because query optimization is the main raison d'etre of
> the algebra, and there it is a completely different story. It can for
> example be more efficient to postpone duplicate elimination. If you don't
> have a bag algebra you cannot express this in your algebra.

In other words this is a matter of possibly having two algebras. One for
database users that is complient with the relational model and closure (i.e.
no dulplicates), and the possability of an extended alegbra used internally by
an implementation to allow the postponement of duplicate elimination in
intermediate results.

That is a useful distinction to make, but not one I've really picked up on
before, so thanks Jan.

> Note that in the writings you mention Date only addresses the first
> question, where what you actually asked concerned mostly the second
> question.

I guess Date says that the second question is an implementation issue so not
something of great concern to him. I've certainly seen mentions of delaying
duplicate elimination in his writings, but only in passing.

Regards
Paul Vernon
Business Intelligence, IBM Global Services


Anton Versteeg

unread,
Feb 13, 2003, 7:36:20 AM2/13/03
to

Lauri Pietarinen wrote:

> I don't think that anybody is suggesting that intermediate results need
> to remove duplicates. It's
> the end result that counts. E.g. in the following code fragment
>
>

Well, I think that even for end results duplicates can be useful.
It is the difference between the set theory and query results in practice.
For instance: a set doesn't have an order but it would be impossible to present
results to a user of our database if we cannot order the end result.

To give an example of the use of duplicates:
Suupose we have a table that holds text (letters for instance).
We would probably have a line number field and a text field.
To improve readability we will have several occurrences of blank lines.
If we then select the text column ordered by the line number, we will have
(meaningful) duplicates in the end result.

>
> Lauri Pietarinen, Senior Consultant, Databases
>
> AtBusiness Communications Oyj, Kaapeliaukio 1, FIN-00180 Helsinki
>
> tel. +358-9-2311 6632, mob. +358-50-594 2011, fax +358-9-2311 6601
> http://www.atbusiness.com, email: lauri.pi...@atbusiness.com
> _____________________________________________________________________

--

Lauri Pietarinen

unread,
Feb 13, 2003, 8:27:42 AM2/13/03
to
>
>
>The book is agnostic on this issue. It's starting point is that you want to
>build an SQL database (and so there are duplicates in your results) and then
>explains how you should do that.
>

OK, thanks.

So the answer to my question is
"they have taken SQL as a basis _in_this_book_ but
probably agree otherwise that duplicates should be
avoided and better optimisations would be obtained
without duplicates"

regards,
Lauri Pietarinen

Jan Hidders

unread,
Feb 13, 2003, 8:17:44 AM2/13/03
to
Lauri Pietarinen wrote:

>Jan Hidders wrote:
>>
>>There are two separate questions here:
>>1. Do we want duplicates in the data model, i.e., in the original relations
>> and the results of queries?
>>2. Do we want duplicates in intermediate results?
>>
>>I'm not completely sure what their answer to 1. is but I suspect it is
>>something like "probably not".
>
>That's not the impression I get from Date's article (see part II), but I
>have not read the book.

The book is agnostic on this issue. It's starting point is that you want to


build an SQL database (and so there are duplicates in your results) and then
explains how you should do that.

-- Jan Hidders

Lauri Pietarinen

unread,
Feb 13, 2003, 8:32:09 AM2/13/03
to
>
>
>Well, I think that even for end results duplicates can be useful.
>It is the difference between the set theory and query results in practice.
>For instance: a set doesn't have an order but it would be impossible to present
>results to a user of our database if we cannot order the end result.
>

I agree that ordering is needed to present data and that a "set" such
ordered rows are not
relations anymore.

>To give an example of the use of duplicates:
>Suupose we have a table that holds text (letters for instance).
>We would probably have a line number field and a text field.
>To improve readability we will have several occurrences of blank lines.
>If we then select the text column ordered by the line number, we will have
>(meaningful) duplicates in the end result.
>

However, you would include the line number column to get the rows
in the correct order. You would just disregard the column in your program
or report generator.

So the end result of the query would _not_ have duplicates!

regards,
Lauri Pietarinen

Alfredo Novoa

unread,
Feb 13, 2003, 8:52:04 AM2/13/03
to
Hi

hid...@REMOVE.THIS.ua.ac.be (Jan Hidders) wrote in message news:<3e4ad...@news.ruca.ua.ac.be>...

> There are two separate questions here:
> 1. Do we want duplicates in the data model, i.e., in the original relations
> and the results of queries?

I only want to add that duplicates also ruin updateable views.

http://www.pgro.uk7.net/views3_0110.htm

Regards,
Alfredo

Paul Vernon

unread,
Feb 13, 2003, 8:51:40 AM2/13/03
to
"Anton Versteeg" <anton_v...@nnll.iibbmm.com> wrote in message
news:3E4B9144...@nnll.iibbmm.com...

>
>
> Lauri Pietarinen wrote:
>
> > I don't think that anybody is suggesting that intermediate results need
> > to remove duplicates. It's
> > the end result that counts. E.g. in the following code fragment
> >
> >
>
> Well, I think that even for end results duplicates can be useful.

If you define 'end results' as not being the result of a relational query, but
the result of some extra non-relational transformation of a relation then OK,
but please don't try to argue that a database is best supported by a 'bag'
algebra (or an array algebra, or a network algebra,...) rather than a
relational algebra.

> It is the difference between the set theory and query results in practice.

No. It is not a case of theory not matching practice (which, if true, tells us
that a theory is broken), rather it is the lack of a clear understanding of
where relational algebra ends and something else (like array variable and
values) picks up.

Of course theorists are sometimes guilty of over stating the scope of a
theory, but much more usually it is the practitioners that do the try to apply
a given theory outside of it's bounds.


> For instance: a set doesn't have an order but it would be impossible to
present
> results to a user of our database if we cannot order the end result.

We live in a four dimensional world, and there is order everywhere: up/down,
left/right, before/after. Because of this we indeed cannot ever see or hear or
feel or somehow sense a Relation per se. The best we can do is to logically
transform an relation into say an array, then present such an array using some
visual display unit, like a computer screen. Things like arrays, trees, lists
etc are all slightly closer to being able to be sensed than relations
(although I'm not sure one can really see an array either, all we see is
light...)

> To give an example of the use of duplicates:
> Suupose we have a table that holds text (letters for instance).
> We would probably have a line number field and a text field.
> To improve readability we will have several occurrences of blank lines.
> If we then select the text column ordered by the line number, we will have
> (meaningful) duplicates in the end result.

OK, but just be clear such an 'end result' is not a relation and so any syntax
to help produce such 'end results' is not part of a relational algebra.
A logically clean syntax would be to 'cast' a relational query result to say
an array, then further 'select' certain array elements for display in some
specified(or unspecified) order.

Unfortunately SQL does not make such a clean separation. Regardless of it
being a bag algebra (of sorts), it lets ORDER BY operate on it's bags without
any explicit casting to a orderable non-scalar type such as an array.

Jan Hidders

unread,
Feb 13, 2003, 11:39:07 AM2/13/03
to
Lauri Pietarinen wrote:
>>
>>The book is agnostic on this issue. It's starting point is that you want to
>>build an SQL database (and so there are duplicates in your results) and then
>>explains how you should do that.
>
>OK, thanks.
>
>So the answer to my question is
>"they have taken SQL as a basis _in_this_book_

Yes.

>but probably agree otherwise that duplicates should be avoided

Yes.

>and better optimisations would be obtained without duplicates"

No. In fact, in theory, all optimizations that can be done in a set-based
algebra can also be done in a bag-based algebra but not the other way
around.

-- Jan Hidders


Lauri Pietarinen

unread,
Feb 13, 2003, 12:33:52 PM2/13/03
to
Yes Paul,

this is more or less what I meant to say, and would
have said, had I had the words for it...

regards
Lauri Pietarinen

Lauri Pietarinen

unread,
Feb 13, 2003, 12:36:49 PM2/13/03
to
>
>
>>and better optimisations would be obtained without duplicates"
>>
>>
>
>No. In fact, in theory, all optimizations that can be done in a set-based
>algebra can also be done in a bag-based algebra but not the other way
>around.
>

So you in fact disagree with Date on that one?

Is not optimisation (at least partly) a question of query transformations?
I understand that more transformations are available when we operate with
sets that if we operate with bags. Even the excerpt from the book
seems to suggest this:

<quote>


For instance, you may have learned set-theoretic laws such as A
INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is
formally the "distributive law of intersection over union." This law
holds for sets, but not for bags.

<quote/>

regards,
Lauri Pietarinen


Paul Vernon

unread,
Feb 13, 2003, 12:34:19 PM2/13/03
to
"Jan Hidders" <jan.h...@REMOVE.THIS.ua.ac.be> wrote in message
news:3e4bc...@news.ruca.ua.ac.be...
[snip]

> >and better optimisations would be obtained without duplicates"
>
> No. In fact, in theory, all optimizations that can be done in a set-based
> algebra can also be done in a bag-based algebra but not the other way
> around.

Obviously, I guess. A bag algebra being a superset of a set alegbra.

However it does not follow that a dbms where users were exposed to a bag-based
algebra would be overall more efficient than one with users 'restricted' to a
set-based alegbra. Not by a long shot.

Mikito Harakiri

unread,
Feb 13, 2003, 3:07:02 PM2/13/03
to
Suppose we have a set {0,1}. Let's move the "second" set member a little
bit: {0,1/2}. So far so good, the set still has 2 elements. Let's move it
one more time: {0,1/4}. We still have set cardinality 2. In the limit,
however, we have the set with only one element {0}. This change of
cardinality is very nasty from math perspective, because it presents a break
of continuity. This is why matematians would consider the limit to be
multiset {0,0} rather than just a set {0}. Mathematical definitions do
contain multiset concept explicitly within their definitions, for example,
"a spectrum is a *multiset* of eigenvalues". Break of continuity manifests
itself in relational theory as a change of row set cardinaly after
projection is applied. On the contrary, in multiset model, projection
doesn't change number of rows. (Selection -- being dual to projection --
doesn't change number of columns in both set and bag models).

In short, logic and set theory together are only a tiny part of mathematics.
Don't evangelize them.

"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message

news:3E4B9E59...@atbusiness.com...

Mikito Harakiri

unread,
Feb 13, 2003, 3:37:11 PM2/13/03
to
"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message
news:3E4B8137...@atbusiness.com...

> < quotes from book Hector Garcia-Molina, Jeffrey D. Ullman, and
> Jennifer Widom, DATABASE SYSTEM IMPLEMENTATION>
>
> [Relational] algebra was originally defined as if relations were sets
> [sic!--italics added].Yet relations in SQL are really bags ... Thus, we
> shall introduce relational algebra as an algebra on bags.
>
> ...
>
> For instance, you may have learned set-theoretic laws such as A
> INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is
> formally the "distributive law of intersection over union." This law
> holds for sets, but not for bags.
>
> < quotes from book/ >

Therefore, the idea here is that Set Algebra is superior to Bag Algebra? Not
for aggregates:

PROJECTION*AGGREGATE != AGGREGATE*PROJECTION

for example

select distinct S from (
select distinct SUM(SAL) from emp
)

is not the same as

select distinct SUM(SAL) from (
select distinct SAL from emp
)

where i'm using SQL with the "distinct" keyword merely as a surrogate for
true relational syntax ("distinct" is redundant after aggregate operation,
of course).


Mikito Harakiri

unread,
Feb 13, 2003, 3:47:40 PM2/13/03
to

"Alfredo Novoa" <alf...@ncs.es> wrote in message
news:e4330f45.03021...@posting.google.com...

> I only want to add that duplicates also ruin updateable views.
>
> http://www.pgro.uk7.net/views3_0110.htm

Why? View updates is about ability to solve equations with relational
operators. While one can certainly claim that Set Algebra is superior to
Bag Algebra, you have to demonstrate what exact technical difficulties are
solving equations in the Bag Algebra. Hand waving doesn't count.


Mikito Harakiri

unread,
Feb 13, 2003, 5:01:11 PM2/13/03
to
"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message
news:3E4B8137...@atbusiness.com...

> For instance, you may have learned set-theoretic laws such as A
> INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is
> formally the "distributive law of intersection over union." This law
> holds for sets, but not for bags.

I wonder why A INTESECT B row multiplicities are not multiplied in the Bag's
INTESECT definition. Well,

{<Smith, 20>, <Smith, 20>} INTERSECT {<Smith, 20>,<Smith, 20>, <Smith, 20>}
=
= {<Smith, 20>,<Smith, 20>, <Smith, 20>,<Smith, 20>,<Smith, 20>, <Smith,
20>}

might seem counterintutive, but we don't have to appeal to layman's common
sence, right? Just make the definition that is consistent with algebraic
rules, and adjust "common sence" to it. Or, am I missing some pitfall?


Lauri Pietarinen

unread,
Feb 13, 2003, 5:37:18 PM2/13/03
to
>
>
>In short, logic and set theory together are only a tiny part of mathematics.
>Don't evangelize them.
>
Well, not being a mathematician,
but having studied some mathematics I was and
still am under the impression that all modern mathematics
is based on set theory which is equivalent with logic.

regards,
Lauri Pietarinen

Lauri Pietarinen

unread,
Feb 13, 2003, 5:45:04 PM2/13/03
to
In short, logic and set theory together are only a tiny part of mathematics.
Don't evangelize them.

Well, not being a mathematician,
but having studied some mathematics I was and
still am under the impression that all modern mathematics
is based on set theory which is equivalent with logic.

Here is a quote from Wikipedia
(http://www.wikipedia.org/wiki/Foundations_of_mathematics)

<quote>

The current dominant mathematical paradigm is based on axiomatic set theory and formal logic. Virtually all mathematical theorems today can be formulated as theorems of set theory. The truth of a mathematical statement, in this view, is then nothing but the claim that the statement can be derived from the axioms of set theory using the rules of formal logic.

<quote/>


Steve Kass

unread,
Feb 13, 2003, 6:21:30 PM2/13/03
to
Mikito,

It's pointless to talk about the limit of a set S without setting forth
some context. If you are talking about sets, the limit of {0,x} as the
real number x goes to zero is {0}. Why is the change of cardinality
"nasty"? There is no mathematical reason to think that the
cardinality function must be continuous.

You could, if you chose, define the limit of a multiset, and you would
have a different discussion. If you defined things in the most natural way,
the limit of {0,x} as x goes to zero would be {0,0}. Discussing continutity
is pointless without context. If f is a function from the collection of
multisets
of reals to the collection of multisets of reals, then you need a topology
on multisets of reals to be able to discuss continuity (relative to that
topology). Without thinking it all through, I suspect a good topology
is the
smallest one containing all open intervals (a,b) (these being sub_sets_ of
reals.

To make the limit {0} make sense for multisets, you'd need every open set
around {0} to contain {e, e} for some e <> 0, and probably for it to
contain {e,e,e,e,...} as well. Maybe it has nice properties, too. But
the {0}
limit is exactly the correct one for sets, of course, under the usual
topology
on the real numbers. Change of cardinality on projection is nothing at all
difficult. Project a line onto a plane in the direction of the line,
and you've
gone from an uncountable set to a finite one. Not changing cardinality is
more trouble.

Be careful not to make claims about good and bad properties of operations
on multisets (or sets) without specifying the context in which you are
making
these statements.

Steve Kass
Drew University

Mikito Harakiri

unread,
Feb 13, 2003, 6:23:48 PM2/13/03
to
"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message news:3E4C1FF0...@atbusiness.com...
In short, logic and set theory together are only a tiny part of mathematics.
Don't evangelize them.

Well, not being a mathematician,
but having studied some mathematics I was and
still am under the impression that all modern mathematics
is based on set theory which is equivalent with logic.
Set Reductionalism is a dicease that I was ill when being a student. Although Sets and Logic certainly spice mathematics with rigour, they are mostly irrelevant for practioner mathematician not working in those fields.
Here is a quote from Wikipedia
(http://www.wikipedia.org/wiki/Foundations_of_mathematics)

<quote>

The current dominant mathematical paradigm is based on axiomatic set theory and formal logic. Virtually all mathematical theorems today can be formulated as theorems of set theory. The truth of a mathematical statement, in this view, is then nothing but the claim that the statement can be derived from the axioms of set theory using the rules of formal logic.

<quote/>

There is a difference between "can" and "have to". Reductionalizm to sets might be unnatural, unpractical, etc. The famous example is the ordered pair definition
 
(a, b) := {{a}, {a, b}}
 
AFAIK, no mathematical theorem is reformulated to use the above set based definition instead of ordered pairs. Ordered tuple is widely considered to be a basic concept (including Relational Theory;-).

Steve Kass

unread,
Feb 13, 2003, 6:28:12 PM2/13/03
to
I didn't read the whole thread, but in the right context, the
bags are functions, and intersect, union, etc. are all well-defined.

Definition: A multiset M over a domain U is a function with
domain U and range the set of non-negative integers.

We say that the element x is in M if M(x) > 0.
We say that the element x appears in M with multiplicity k, if M(x) = k
We say that M = M' for two multisets M and M' on domains U and U'
respectively if M(x) = M'(x) for all x in the intersection of U and U' and
M(x) = 0 when x is not in U, and M'(x) = 0 when x is not in U'.

Functions are well-defined in terms of sets also, subsets of U x Z where
(x,n) and (x,m) both in a function f implies n = m (single-valuedness).

It makes no sense to take the union of two functions, but it does make sense
to take the sum (union all), min (intersection), and other operations.

Just define things carefully and it's all much easier. If you think the
intersection
of two multisets, one containing two S's and one containing 3 should be a
multiset containing 6 S's, then see if it the product operation f*g(x) =
f(x)*g(x)
has useful properties for modeling the real world. I don't see why you
would
call something like that intersect, though.

As for what laws hold for multisets, you just go figure out which ones do.
Matrix multiplication is not commutative, but that doesn't make matrices
"bad"
or useless. It's just something you need to know.

Steve Kass
Drew University

Bob Badour

unread,
Feb 13, 2003, 6:23:14 PM2/13/03
to
"Paul Vernon" <paul....@ukk.ibmm.comm> wrote in message
news:b2g807$su6$1...@sp15at20.hursley.ibm.com...

Paul,

I can certainly understand your criticism of ORDER BY if it were applied to
SELECT expressions nested within other operations such as INSERT INTO,
CREATE VIEW etc. -- if those are even allowed by the standard or by any
product.

I am not sure I understand your criticism of ORDER BY for presentation. In
this situation, doesn't ORDER BY apply to the operation that physically
encodes the result for transmission outside the DBMS? It seems to me the
operation in question is quite explicit even if the syntax requires no
keyword for it.

Even if one were to type-cast the result to an implicitly ordered non-scalar
type such as an array, one would still require a final operation that
physically encodes the array for transmission outside the DBMS.

To put it precisely, in both cases the DBMS starts with an internal physical
encoding of an abstract logical value and converts it to an external
physical encoding of the same value.

Why is it necessary or desirable to specify two operations instead of one?

IE. one operation that changes the type using an explicit order followed by
a second operation that physically encodes the ordered type VS. a single
operation that physically encodes the result in a specified order

Referring to the concepts in TTM, it seems like a point of design with
interesting design tradeoffs. When designing type generators for relations
and arrays, what possible representations will the generated types have?
What are the pro's and con's of choosing one set of possible representations
over another?

It sounds like you may have already put more thought into these design
tradeoffs than I have, and I am interested in your opinions.


Bob Badour

unread,
Feb 13, 2003, 6:24:15 PM2/13/03
to
"Paul Vernon" <paul....@ukk.ibmm.comm> wrote in message
news:b2g807$su6$1...@sp15at20.hursley.ibm.com...

Anton, Lauri and Paul,

Your recent exchange seems to be talking around a couple of issues, and I
just want to try to express the issues directly for clarity--more to help me
and other readers see and understand the shortcuts you may have taken in
your reasoning than for any other purpose. I realise that you and many of
the frequent contributors to this newsgroup will find it quite remedial.

Sets have no implicit order. When representing a set, one may order the
members of the set in any way without changing the thing represented. In
other words, all orders of representation are equally valid to describe a
set.

Some operations on relations require an explicit order: quota query, min,
max etc. Using physical order in the representation of relations can
sometimes speed the evaluation of unordered operations like joins.

Presentation implies some tangible physical representation, which by
definition lies outside the scope of logical data models. Specifically,
'presentation' usually means external physical representation as opposed to
internal physical representation.

Logical representations relying on an implicit order are generally
undesirable for data management. Implicit order in the logical
representation will generally require a fixed order in the physical
representation thereby disallowing many otherwise allowable storage and
manipulation options, which for some operations will include the most
efficient options. Because implicit order is irrelevant to sets, database
management systems that logically represent data as sets may use any
convenient physical order to improve performance when evaluating operations.

Logical arrays, bags, trees and lists have no more tangible existence than
relations or sets. They are all abstract. For human reception, we must
encode any of these abstractions on some physical medium for transmission:
ink on paper, fluorescing phosphorous on glass, scrapings in wax or sand,
bumps on paper, banks of lights or LEDs etc.

Even the physical devices composing a computer system must encode the
abstractions above onto some physical medium: isolated charged regions on a
semi-conductor device, magnetic alignment of small regions of ferrous
materials, holes in a piece of paper etc.

Many equivalent physical representations or encodings will exist for each of
the logical abstractions, because operations rather than physical encodings
define the abstractions. One really draws from the same universe of physical
representations when encoding any of the above logical abstractions.
However, those abstractions relying on implicit order must either draw from
a smaller subset of that universe or must extend the managed data with some
additonal encoding of the implicit order.

In some cases, a particular physical encoding will have specific features or
properties that map directly to the specific operations defining a logical
abstraction. For instance, encodings of physical memory addresses as 'link
pointers' may map directly to the Next and Previous operations for a list
abstraction or physically encoding sequential elements at sequential memory
addresses with a fixed offset may map directly to array referencing
operations.

When most people discuss certain logical abstractions such as arrays or
lists, they usually assume physical encodings drawn from a very small subset
of the possible encodings--specifically they assume those physical encodings
with features that map to the most frequently used operations for the
logical abstraction. For example, most people would assume some encoding of
link pointers when talking about lists, and most people would assume some
sequential encoding--possibly at some level of indirection--when talking
about arrays.

Duplicates in a logical abstraction can only have meaning if the logical
abstraction relies on implicit order or some other implicit information.
Otherwise, users cannot reference the individual instances even to count
them. IMO, that's an important concept to understand.

Those who argue one only needs to add a COUNT operation fail to recognize
that the resulting abstraction relies on implicit order because the COUNT
operation would depend on the implicit order of the internal physical
storage.


D Guntermann

unread,
Feb 13, 2003, 6:14:54 PM2/13/03
to

"Jan Hidders" <hid...@REMOVE.THIS.ua.ac.be> wrote in message
news:3e4ad...@news.ruca.ua.ac.be...
> Lauri Pietarinen wrote:
> >
> >what is your take on Garcia-Molina, Ullman and Jennifer Widom
> >regarding their stand on duplicates?
> >
> >(see http://www.dbdebunk.com/cjddtdt.htm and cjddtdt2)
> >
[snipped for brevity]

>
> But how your algebra looks depends on how you
> answer question 2, because query optimization is the main raison d'etre of
> the algebra, and there it is a completely different story. It can for
> example be more efficient to postpone duplicate elimination. If you don't
> have a bag algebra you cannot express this in your algebra.
>

I'll try not to sound too ignorant, but I'm afraid I will anyway, as I
haven't had time to read the link to Mr. Date's comments yet.

Why does query optimization have to expose bag algebra? Can't structures be
manipulated at a lower internal representation that is independent of a
logical data model and logical manipulative aspects (e.g. relational model
with relational algebra)? Isn't this the point of ANSI-SPARC architecture?
I wonder what the ramifications are on data independence if relations,
whether base, derived, or intermediate, are allowed to incorporate bags in
order to accomodate optimization.

Regards,

- Dan

Bob Badour

unread,
Feb 13, 2003, 6:47:32 PM2/13/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:irT2a.10$O%2....@news.oracle.com...

Why should the sum of all salaries equal the sum of distinct salaries?


Mikito Harakiri

unread,
Feb 13, 2003, 7:28:15 PM2/13/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2h96p$ma2$1...@slb9.atl.mindspring.net...

> Mikito,
>
> It's pointless to talk about the limit of a set S without setting forth
> some context. If you are talking about sets, the limit of {0,x} as the
> real number x goes to zero is {0}. Why is the change of cardinality
> "nasty"? There is no mathematical reason to think that the
> cardinality function must be continuous.

Steve,

Let me try to provide an example as a context.

Consider a database of all polynomial roots. In multiset model we have

table Polynomial (
id integer,
power integer,
coeff number
)

CONSTRAINT c1 UNIQUE KEY (id, power)
CONSTRAINT c2 CHECK ( power >= 0 )

table Roots (
polynomial id number,
root number
)

In the set model, I would have to make all tuples in the Roots relation to
be distinct, and, therefore, add some extra column. When sets advocates talk
about a tuple reflecting some real-world relationship, how would they
interpret this extra column? If numbers are real, it might be the
enumeration induced by the total ordering of the roots. What if the numbers
are complex?

> You could, if you chose, define the limit of a multiset, and you would
> have a different discussion. If you defined things in the most natural
way,
> the limit of {0,x} as x goes to zero would be {0,0}. Discussing
continutity
> is pointless without context. If f is a function from the collection of
> multisets
> of reals to the collection of multisets of reals, then you need a topology
> on multisets of reals to be able to discuss continuity (relative to that
> topology). Without thinking it all through, I suspect a good topology
> is the
> smallest one containing all open intervals (a,b) (these being sub_sets_ of
> reals.

I don't follow. A set of all open intervals (a,b) defines just a standard
topology on R. How would it work for sets, or multisets of reals?

> To make the limit {0} make sense for multisets, you'd need every open se
t
> around {0} to contain {e, e} for some e <> 0, and probably for it to
> contain {e,e,e,e,...} as well. Maybe it has nice properties, too. But
> the {0}
> limit is exactly the correct one for sets, of course, under the usual
> topology
> on the real numbers.

Again, in which topology? We are talking about the space of all sets (or,
alternatively, multisets) of real numbers, not just the space of reals.

> Change of cardinality on projection is nothing at all
> difficult. Project a line onto a plane in the direction of the line,
> and you've
> gone from an uncountable set to a finite one. Not changing cardinality is
> more trouble.

Agreed.

> Be careful not to make claims about good and bad properties of
operations
> on multisets (or sets) without specifying the context in which you are
> making
> these statements.

Granted, but does, the Relational Theory itself always provide such a
context? For example, the Alice Book defines relation in the "Name vs.
Unnamed Perspective", then in "Conventional vs. Logic Programming
Perspective". Not exactly presize definition. What is a relation? Is the
column ordering important? If not, is the naming function important?

Mikito Harakiri

unread,
Feb 13, 2003, 7:54:15 PM2/13/03
to
"Bob Badour" <bba...@golden.net> wrote in message
news:4uW2a.1409$0G3.16...@mantis.golden.net...

> "Mikito Harakiri" <mikha...@ywho.com> wrote in message
> news:irT2a.10$O%2....@news.oracle.com...
> > Not
> > for aggregates:
> >
> > PROJECTION*AGGREGATE != AGGREGATE*PROJECTION
> >
> > for example
> >
> > select distinct S from (
> > select distinct SUM(SAL) from emp
> > )
> >
> > is not the same as
> >
> > select distinct SUM(SAL) from (
> > select distinct SAL from emp
> > )
> >
> > where i'm using SQL with the "distinct" keyword merely as a surrogate
for
> > true relational syntax ("distinct" is redundant after aggregate
operation,
> > of course).
>
> Why should the sum of all salaries equal the sum of distinct salaries?

In set model they don't. In the bag model, however, we rewrite the above
identity without the "distinct" keyword:

select S from (
select SUM(SAL) S from emp
)

is the same as

select SUM(SAL) from (
select SAL from emp
)

Once again, this means that PROJECTION and AGGREGATE commute in the bag
model.


Bob Badour

unread,
Feb 13, 2003, 8:19:38 PM2/13/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:2%S2a.9$O%2....@news.oracle.com...

> Suppose we have a set {0,1}. Let's move the "second" set member a little
> bit: {0,1/2}. So far so good, the set still has 2 elements. Let's move it
> one more time: {0,1/4}. We still have set cardinality 2. In the limit,
> however, we have the set with only one element {0}. This change of
> cardinality is very nasty from math perspective, because it presents a
break
> of continuity. This is why matematians would consider the limit to be
> multiset {0,0} rather than just a set {0}. Mathematical definitions do
> contain multiset concept explicitly within their definitions, for example,
> "a spectrum is a *multiset* of eigenvalues". Break of continuity manifests
> itself in relational theory as a change of row set cardinaly after
> projection is applied. On the contrary, in multiset model, projection
> doesn't change number of rows. (Selection -- being dual to projection --
> doesn't change number of columns in both set and bag models).
>
> In short, logic and set theory together are only a tiny part of
mathematics.
> Don't evangelize them.

Since this newsgroup devotes itself to database theory, I am curious where
you believe a similar application of limits at infinity might cause a
problem for database management.

As for projection, when I place my hands together aligning finger to finger
and thumb to thumb then hold them between the lightbulb and the wall, I see
only one hand in the projected shadow. Similarly, projecting a finite set of
points from the surface of a sphere onto a plane will yield points within a
circle. Two points on the surface of the sphere will very often project onto
a single point within the circle.

I suppose we could define the plane of projection as a multiset of points
and then the projection of the circle onto the plane could have a multiset
of points. I am not aware of any practical problems in geometry requiring
this multiset approach--granted my exposure to geometrical problems is
probably quite limited.

I have a question though: How many of each point should the plane of
projection have in its multiset? It's clear two of each would suffice for
the sphere/circle example given above. But does it suffice for a general
definition of projection? What if we project a perpendicular line onto the
plane? Then we need an infinite number of the point of intersection. Does
this mean that the plane of projection must always have an infinite number
of every point? It seems a little extreme.

How would the multiset plane with an infinite number of each point differ
from a set plane with one of each point? When does this have a practical
application?

It seems to me that most folks are quite satisfied to accept that two points
from a sphere can project onto one point within a circle without worrying
about a break in continuity. Nobody worries that one cannot reconstruct the
sphere from the projected set of points. It seems to me that one cannot
reconstruct the sphere from the projected multiset of points either.

In your example of the multiset of {0,0}, the physical location of the
zeroes becomes important for determining the cardinality. Even assuming that
{0,0,1} and {0,1,0} are equivalent multisets, the cardinality depends on the
existence of a zero in two different physical locations in the
representation. Physical independence has long been recognized to have many
benefits for database management.

To regain physical independence, I suppose we could change our
representation to not rely on location by including the cardinality of each
distinct value in the representation: {0x1,1x1}->{0x1,1/2x1}->...->{0x2}

If we determine the cardinality of the multiset by adding the cardinalities
of each distinct value, we see that the cardinality remains constant at 2.

Haven't we used an extended form of projection to transform the
representation? ie. projection extended with the COUNT aggregate operation
or what Codd once called the degree of duplication operation. Going from the
representation you originally used to the latter logically equivalent
representation, hasn't the number of times we represented 0 changed from 2
to 1?

Certainly, we have many practical applications for a project operation that
alters cardinality. If we extend the project operation into some kind of
summarize operation with aggregate functions--as is universally accepted
practice for dbmses--I see no need for multisets nor for a project operation
that preserves cardinality and results in multisets. We have a logically
equivalent mapping of any resulting multiset onto a set, and we can just use
the set instead.

Granted, we could choose to have multisets and a cardinality preserving
project operation regardless of need, but in doing so, we would necessarily
give up logical identity and physical independence. I think you will have a
difficult time providing a logical justification for the loss.


Bob Badour

unread,
Feb 13, 2003, 8:48:53 PM2/13/03
to

"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:XPW2a.15$O%2....@news.oracle.com...

root number,
degree integer
)

CONSTRAINT c3 UNIQUE KEY (polynomial id, root)
CONSTRAINT c4 CHECK ( degree > 0 )
CONSTRAINT c5 CHECK P in Polynomial (
P.power = sum(R.degree | { R in Roots, P.id = R.polynomial id})
)

I would interpret 'degree' as the number of coincident roots at 'root', and
my interpretation does not depend on the specific set of values in the
'number' type.


Mikito Harakiri

unread,
Feb 13, 2003, 8:57:14 PM2/13/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2h9jb$p0h$1...@slb4.atl.mindspring.net...

> I didn't read the whole thread, but in the right context, the
> bags are functions, and intersect, union, etc. are all well-defined.

I don't agree that the matter is "context dependent". One either has a model
with a consistent set of operations that users like, or not.

> Definition: A multiset M over a domain U is a function with
> domain U and range the set of non-negative integers.
>
> We say that the element x is in M if M(x) > 0.
> We say that the element x appears in M with multiplicity k, if M(x) = k
> We say that M = M' for two multisets M and M' on domains U and U'
> respectively if M(x) = M'(x) for all x in the intersection of U and U' and
> M(x) = 0 when x is not in U, and M'(x) = 0 when x is not in U'.
>
> Functions are well-defined in terms of sets also, subsets of U x Z where
> (x,n) and (x,m) both in a function f implies n = m (single-valuedness).
>
> It makes no sense to take the union of two functions, but it does make
sense
> to take the sum (union all), min (intersection), and other operations.

Union is redundant in multiset model anyway. It is a distinct "union all".
"Distinct" is a grouped aggregate.

> Just define things carefully and it's all much easier. If you think the
> intersection
> of two multisets, one containing two S's and one containing 3 should be a
> multiset containing 6 S's, then see if it the product operation f*g(x) =
> f(x)*g(x)
> has useful properties for modeling the real world.

With my definition

select name from e1
intersect
select name from e2

is identical to

select e1.name from e1, e2
where e1.name = e2.name

This is not true with the "min" definition.

> I don't see why you
> would
> call something like that intersect, though.

See the above.

> As for what laws hold for multisets, you just go figure out which ones do.

Found one that doesn't hold for multisets:-(

select e1.name from (
select e1.name, e2.name from emp e1, emp e2
)

is not the same as

select name from emp

More concisely:

PROJECT*POWER != 1


Bob Badour

unread,
Feb 13, 2003, 9:18:42 PM2/13/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:jcX2a.16$O%2....@news.oracle.com...

With all due respect, Mikito,

The differences above are not about the model. They are about the questions
asked. In your latter example, you are comparing the sum of all salaries
with the sum of all salaries. In your former example, you compared the sum
of all salaries with the sum of distinct salaries.

The sensible optimization for the first query in each comparison is to
simply drop the outer select operation as a no-op.

select S from (
select SUM(SAL) S from emp
)

is logically equivalent to

select SUM(SAL) S from emp

regardless whether we are dealing with sets or multisets.

I wonder why someone would request:

select SUM(SAL) from (
select SAL from emp
)

when one wants the sum of all salaries. It seems like a lot of trouble to go
to when one could just as easily write:

select SUM(SAL) from emp

which more naturally reflects a request for the sum of all employee
salaries.

Of course, in a set based system, distinct wouldn't even be a keyword and
everyone would know that:

select SAL from emp

means a relation with the distinct salaries in emp. Users would immediately
realise that:

select SUM(SAL) from (
select SAL from emp
)

means the sum of the distinct salaries in emp.

You might suggest that the inner query in each case above is a view. In that
case, I would point out that most views are created by advanced users who
should know better than to create a view of distinct salaries when users
need a view of all salaries and vice versa--and the advanced users will
generally know how to write either view.


Mikito Harakiri

unread,
Feb 13, 2003, 9:45:21 PM2/13/03
to

"Bob Badour" <bba...@golden.net> wrote in message
news:RHY2a.1418$0T3.16...@mantis.golden.net...

> I wonder why someone would request:
>
> select SUM(SAL) from (
> select SAL from emp
> )
>
> when one wants the sum of all salaries.

<user's logic>
I want the sum of all salaries. There are a lot of confusing columns in the
Emp view, but I think that all the columns other than Sal are not important
for the question I'm asking. Therefore, let's project them away. Now I'm
adding up all the salaries. Why the hell am I short of $100K?
</user's logic>


Bob Badour

unread,
Feb 14, 2003, 12:00:33 AM2/14/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:tQY2a.19$O%2....@news.oracle.com...

Frankly, I think you are stretching your user's logic beyond the credible.
All those confusing columns are irrelevant to

select SUM(SAL) from emp

which is almost as simple from the user's perspective as is

select SAL from emp

Your example assumes your users lack sufficient knowledge and training to
formulate even the simplest of queries. If that is the case, they are as
likely to get incorrect or incomprehensible results with a multiset dbms as
with a relational dbms.


Steve Kass

unread,
Feb 14, 2003, 12:13:14 AM2/14/03
to
Bob,

Thanks - that's exactly the way to model this, and in fact, the two models
reflect the usual mathematical definition of multiset as a function (set
of ordered
pairs), with the convenience of omitting rows where the function value
is zero.

Steve

Steve Kass

unread,
Feb 14, 2003, 12:31:35 AM2/14/03
to

Mikito Harakiri wrote:

>"Steve Kass" <sk...@drew.edu> wrote in message
>news:b2h9jb$p0h$1...@slb4.atl.mindspring.net...
>
>
>>I didn't read the whole thread, but in the right context, the
>>bags are functions, and intersect, union, etc. are all well-defined.
>>
>>
>
>I don't agree that the matter is "context dependent". One either has a model
>with a consistent set of operations that users like, or not.
>
>

Whether "users like" something has nothing to do with its mathematical
consistency.

I didn't say it was. I said that "intersect" as you define it
is given by the product of the functions. "intersect" the way SQL
defines it is given by the minimum of the functions. They are both
valid definitions, and if one is better for you, create a language
for multiset manipulations that uses the word intersect for that one.

Because of its use for sets, intersect (I think) gives the impression
of, well, intersection, not product. I just don't see why you'd call
the operation defined by f op g (x) = f (x) * g(x) "intersect" instead
of product.

>
>
>>I don't see why you
>>would
>>call something like that intersect, though.
>>
>>
>
>See the above.
>
>
>
>>As for what laws hold for multisets, you just go figure out which ones do.
>>
>>
>
>Found one that doesn't hold for multisets:-(
>
>select e1.name from (
> select e1.name, e2.name from emp e1, emp e2
>)
>
>is not the same as
>
>select name from emp
>
>More concisely:
>
>PROJECT*POWER != 1
>
>
>

Right. But there's no problem here, just as there's no problem
with the fact that matrix multiplication is not commutative, even
though matrices can be considered a generalization of a set where
the multiplication is commutative.

For the record, the inverse operation of what you are calling power
is relational division.

Let emp be a multiset. Let emp X emp be the cartesion product of
two copies of emp. Then (emp X emp) / emp is emp, if we read /
to mean relational division. The mathematical analogy is the factor
operation where you factor a group by a normal subgroup to obtain
cosets.

Steve

>
>
>
>
>
>

Steve Kass

unread,
Feb 14, 2003, 12:49:10 AM2/14/03
to

Mikito Harakiri wrote:

>
> There is a difference between "can" and "have to". Reductionalizm to
> sets might be unnatural, unpractical, etc. The famous example is the
> ordered pair definition
>
> (a, b) := {{a}, {a, b}}
>
> AFAIK, no mathematical theorem is reformulated to use the above set
> based definition instead of ordered pairs. Ordered tuple is widely
> considered to be a basic concept (including Relational Theory;-).

That's why Lauri quoted a statement that said "can", not "have to".

I don't think mathematicians usually take any new notation as basic and
not needing a definition as a set until they make sure there is a
consistent set-theoretic definition. Once that is done (once), there
may never be a need to resort to manipulating the underlying sets, as is
typically the case with ordered pairs.

If you haven't read it, read Don Knuth's book "Surreal Numbers," which I
think makes a nice case for never forgetting the connections to set theory.

Steve

Paul

unread,
Feb 14, 2003, 4:31:03 AM2/14/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message news:<m7Y2a.18$O%> "Steve Kass" <sk...@drew.edu> wrote in message

> > I didn't read the whole thread, but in the right context, the
> > bags are functions, and intersect, union, etc. are all well-defined.
>
> I don't agree that the matter is "context dependent". One either has a model
> with a consistent set of operations that users like, or not.

What's the interpretation given to a table with duplicate rows?

In relational theory, each relation corresponds to a logical predicate
e.g. the employee table has a truth-valued function f(emp, dept,
salary) say.
and each row in the relation corresponds to the proposition obtained
by substituting actual values for the placeholders emp, dept, salary.
e.g. f("J Smith", 10, 20,000) means J Smith is in dept 10 and has a
salary of 20,000.

So with this interpretation duplicate rows don't make sense because
you're just making the same assertion more then once, which doesn't
make it any more true.
So is "multi-set" relational algebra not based on predicate logic?

Also, there's no way for users to identify a row because the thing
that differentiates them is held internally and is thus invisible to
the user - and this breaks the logical/physical distinction.

You can always get the same duplicate row functionality in true
relational algebra by adding another column which contains the "count"
for the rest of the attributes.

If multi-set algebra is used maybe it should always be internal (i.e.
part of the physical implementation) and hidden from the user?

Paul.

Paul

unread,
Feb 14, 2003, 7:07:36 AM2/14/03
to
"Bob Badour" <bba...@golden.net> wrote in message news:<g8W2a.1405$Ay3.16...@mantis.golden.net>...

> Some operations on relations require an explicit order: quota query, min,
> max etc.

Isn't there two separate issues here?
1) order on domains
2) order of tuples within a relation

The first would be generally desirable (not sure about things like
image-valued domains though) and is needed for the MIN & MAX aggregate
operations.

The second isn't really part of relational theory but appears to be
needed for quota queries (I assume this is things like "SELECT TOP 5 *
FROM emp ORDER BY salary"). Although thinking about it this ORDER BY
is really different to the standard "SELECT * FROM emp ORDER BY
salary":

The TOP 5 query still returns an unordered set in theory: although you
get the top 5 salaries it wouldn't matter in what order those 5 were
returned. In fact you can do these TOP-style queries using COUNT and
subqueries I think, which is just about domain ordering. So really
this ORDER BY is syntactic shorthand, which also deals with problems
like ties etc.

The "SELECT * FROM emp ORDER BY salary" is different because it
implies an order to the *relation* not just the domains of its
attributes.

I suppose that in theory a query language could do without an ORDER BY
and have any ordering done in the client application. But for physical
reasons it would be more efficient for the DBMS to choose an
optimization method that would output the rows in a specified order.

Paul.

Paul Vernon

unread,
Feb 14, 2003, 8:57:38 AM2/14/03
to
"Bob Badour" <bba...@golden.net> wrote in message
news:l7W2a.1404$7C3.16...@mantis.golden.net...

> Paul,
>
> I can certainly understand your criticism of ORDER BY if it were applied to
> SELECT expressions nested within other operations such as INSERT INTO,
> CREATE VIEW etc. -- if those are even allowed by the standard or by any
> product.
>
> I am not sure I understand your criticism of ORDER BY for presentation. In
> this situation, doesn't ORDER BY apply to the operation that physically
> encodes the result for transmission outside the DBMS? It seems to me the
> operation in question is quite explicit even if the syntax requires no
> keyword for it.

Maybe it is a matter of taste, but I find it highly non-explict that

select distinct SAL from emp

returns a Relation, but

select distinct SAL from emp order by SAL

returns something else.


> Even if one were to type-cast the result to an implicitly ordered non-scalar
> type such as an array, one would still require a final operation that
> physically encodes the array for transmission outside the DBMS.
>
> To put it precisely, in both cases the DBMS starts with an internal physical
> encoding of an abstract logical value and converts it to an external
> physical encoding of the same value.
>
> Why is it necessary or desirable to specify two operations instead of one?

Necessary for good language design and conceptual clarity I would say.
Implicit casting is not a very good idea at the best of times and certanlly
not for something as big as a change from one non-scalar type to another.

> IE. one operation that changes the type using an explicit order followed by
> a second operation that physically encodes the ordered type VS. a single
> operation that physically encodes the result in a specified order

CAST_AS_ARRAY(select SUM(SAL), DEPT from emp)
Y_ORDERED_BY SAL

would be logicaly more correct that the SQL syntax, but finding a nice syntax
(which that above certainly is not) is another matter.

> Referring to the concepts in TTM, it seems like a point of design with
> interesting design tradeoffs. When designing type generators for relations
> and arrays, what possible representations will the generated types have?
> What are the pro's and con's of choosing one set of possible representations
> over another?

Poss reps are really only for scalar types. Although according to Celko there
are many ways to represent a tree, however I can't see it being applicable to
relations.

> It sounds like you may have already put more thought into these design
> tradeoffs than I have, and I am interested in your opinions.

Possibly. It's certainly an interesting area when one considers 'multiple
selection' and then a mapping of databases and database subsets to other
non-scalar variables. For example, what is the non-scalar type that is
represented by say an OLAP reporting tool or maybe an email application? How
much independence is there between the graphical display of information and
non-scalar types. My suspision is that jumping straight from relations to
graphical displays is too much of a leap, and what is required is possiably a
number of intermidiate non-scalar types that allow for data display
specificaton and more limited but more intuative data manipulation options.

Jan Hidders

unread,
Feb 14, 2003, 2:06:28 PM2/14/03
to
Lauri Pietarinen <lauri.pi...@atbusiness.com> wrote in message news:<3E4BD7B1...@atbusiness.com>...
> >
> >>and better optimisations would be obtained without duplicates"
> >
> >No. In fact, in theory, all optimizations that can be done in a set-based
> >algebra can also be done in a bag-based algebra but not the other way
> >around.
>
> So you in fact disagree with Date on that one?

Well, let's say that in his "war on duplicates" he sometimes IMO tends
to oversimplify things.

> Is not optimisation (at least partly) a question of query transformations?

Yes.

> I understand that more transformations are available when we operate with
> sets that if we operate with bags. Even the excerpt from the book
> seems to suggest this:
>
> <quote>


> For instance, you may have learned set-theoretic laws such as A
> INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is
> formally the "distributive law of intersection over union." This law
> holds for sets, but not for bags.

> <quote/>

As Paul Vernon correctly remarked a bag algebra will always be a
superset of a set algebra in the sense that for all expressions in the
set algebra there is an equivalent expression the bag algebra. As a
consequence every algebraic rule in the set algebra has a
corresponding rule in the bag algebra. For example, for the rule above
you would have in the bag algebra (here SET is the unary operation
that removes duplicates):

SET(SET(A) INTERSECT(SET(B UNION C))) =
SET(SET((A) INTERSECT SET(B)) UNION (SET(A) INTERSECT SET(C)))

So, although things get a bit more complicated, there is absolutely no
reason why a query optimizer that is based on a bag algebra would
perform worse than one that is based on a set algebra.

-- Jan Hidders

Mikito Harakiri

unread,
Feb 14, 2003, 2:38:35 PM2/14/03
to
"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
news:51d64140.03021...@posting.google.com...

> What's the interpretation given to a table with duplicate rows?
>
> In relational theory, each relation corresponds to a logical predicate
> e.g. the employee table has a truth-valued function f(emp, dept,
> salary) say.
> and each row in the relation corresponds to the proposition obtained
> by substituting actual values for the placeholders emp, dept, salary.
> e.g. f("J Smith", 10, 20,000) means J Smith is in dept 10 and has a
> salary of 20,000.

Interpretation is not always the driving force of the theory. Widely
successful theory can have poor interpretation. Quantum Mechanics Copenhagen
Interpretation, for example, has some conflicts with "common sence".

> So with this interpretation duplicate rows don't make sense because
> you're just making the same assertion more then once, which doesn't
> make it any more true.
> So is "multi-set" relational algebra not based on predicate logic?

Correct.

> Also, there's no way for users to identify a row because the thing
> that differentiates them is held internally and is thus invisible to
> the user - and this breaks the logical/physical distinction.

This is a subtle issue. For example, a number 12 is 10^1*1+2*10^0. Any
numerical value can be thought as an aggregate of its base coefficients. We
could even store it in the database in dismembered state

table Item (
name string,
count_in_1s integer,
count_in_10s integer,
...
)

(but for practical reasons we don't). Is this representation breaking
logical/physical distinction?

Now, in base-1 system the same number is 1+1+1+1+1+1+1+1+1+1+1+1. Multiset
Representation is just dismembering numbers in base-1 system and pivoting
the result so that we have multiple rows instead of multiple columns.

> You can always get the same duplicate row functionality in true
> relational algebra by adding another column which contains the "count"
> for the rest of the attributes.

Adding count column is certainly possible, as Bob demonstrated it yet one
more time. I'm just verifying if this practice is not inferior to explicitly
having multiset concept in the model.

> If multi-set algebra is used maybe it should always be internal (i.e.
> part of the physical implementation) and hidden from the user?

What is this logical/physical distinction? I want a declarative
computational model that supports aggregation, ordering and transitive
closure in addition to the other usual things: projection, restriction,
cartesian product, etc. It would allow me much more than current Relational
Model offers.

I'm not claiming that multisets are superior to sets. There seems to be some
some cases where they do, and I want to see the whole picture.

Given that multiset is essentially a mapping of a rows to nonnegative
integers, wouldn't it be more advantageous to generalize it to mapping of a
rows to reals?


Jan Hidders

unread,
Feb 14, 2003, 3:25:32 PM2/14/03
to
"Paul Vernon" <paul....@ukk.ibmm.comm> wrote in message news:<b2gl27$nds$1...@sp15at20.hursley.ibm.com>...
> "Jan Hidders" <jan.h...@REMOVE.THIS.ua.ac.be> wrote in message
> news:3e4bc...@news.ruca.ua.ac.be...

> >
> > No. In fact, in theory, all optimizations that can be done in a set-based
> > algebra can also be done in a bag-based algebra but not the other way
> > around.
>
> Obviously, I guess. A bag algebra being a superset of a set alegbra.

Exactly. (Although the term "superset" is strictly speaking not
correct because it is not necessarily so that the operators of the bag
algebra are a superset of those of the set algebra.) My compliments
for your insight. :-)

> However it does not follow that a dbms where users were exposed to a bag-based
> algebra would be overall more efficient than one with users 'restricted' to a
> set-based alegbra. Not by a long shot.

Yes, that by itself is not a sufficient argument. However, combined
with the fact that you sometimes want internally a bag algebra anyway,
and that if the user wants to do bag-like things these will be harder
to recognize as such by the query optimizer and therefore also harder
to match with the bag operations of its internal algebra, then this is
suddenly not so clear anymore.

-- Jan Hidders

Mikito Harakiri

unread,
Feb 14, 2003, 4:05:21 PM2/14/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2husl$hj7$1...@slb3.atl.mindspring.net...

> >I don't agree that the matter is "context dependent". One either has a
model
> >with a consistent set of operations that users like, or not.
> >
> Whether "users like" something has nothing to do with its mathematical
> consistency.

But a successful theory is the one that is adopted by the users:-)

> Because of its use for sets, intersect (I think) gives the impression
> of, well, intersection, not product. I just don't see why you'd call
> the operation defined by f op g (x) = f (x) * g(x) "intersect" instead
> of product.


If "*" operation works like this

0*0=0
0*1=0
1*1=1

then it can be called intersection, even though it gives

2*3=6

Intersection is a product!

> >More concisely:
> >
> >PROJECT*POWER != 1
> >
> Right. But there's no problem here, just as there's no problem
> with the fact that matrix multiplication is not commutative, even
> though matrices can be considered a generalization of a set where
> the multiplication is commutative.

I don't see how matrices can be viewed as generalization of sets. Matrices
are Linear Transformations, not sets.

On the other hand, you may be right. Projection is defined as an operation
that obeys the following law:

PROJECT * PROJECT = PROJECT

There is no conlict between this and the above PROJECT*POWER != 1.

In general, the whole issue about sets vs. bags might be not important. What
if we dismiss tuples altogether? Math frequently takes an operational
approach, when we are agnostic about the nature of the space and are only
concerned with space transformations. "Tell me what operations you have and
what laws do they obey, and I might cook some interpretation for you".


Mikito Harakiri

unread,
Feb 14, 2003, 5:55:21 PM2/14/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:JXc3a.12$yd...@news.oracle.com...

> In general, the whole issue about sets vs. bags might be not important.
What
> if we dismiss tuples altogether? Math frequently takes an operational
> approach, when we are agnostic about the nature of the space and are only
> concerned with space transformations.

Interestingly, the situation for the rows (sorry, tuples;-) resembles the
column affair.
Do we allow row duplicates? How do we distinguish them, then? How do we
distinguish columns? Is column ordering important? Is row ordering
important?


Jan Hidders

unread,
Feb 14, 2003, 7:45:55 PM2/14/03
to
D Guntermann wrote:
>
>"Jan Hidders" <hid...@REMOVE.THIS.ua.ac.be> wrote in message
>news:3e4ad...@news.ruca.ua.ac.be...
>> Lauri Pietarinen wrote:
>> >
>> >what is your take on Garcia-Molina, Ullman and Jennifer Widom
>> >regarding their stand on duplicates?
>> >
>> >(see http://www.dbdebunk.com/cjddtdt.htm and cjddtdt2)
>> >
>[snipped for brevity]
>>
>> But how your algebra looks depends on how you
>> answer question 2, because query optimization is the main raison d'etre of
>> the algebra, and there it is a completely different story. It can for
>> example be more efficient to postpone duplicate elimination. If you don't
>> have a bag algebra you cannot express this in your algebra.
>
>I'll try not to sound too ignorant, but I'm afraid I will anyway, as I
>haven't had time to read the link to Mr. Date's comments yet.
>
>Why does query optimization have to expose bag algebra?

I don't think I said it has to.

>Can't structures be manipulated at a lower internal representation that is
>independent of a logical data model and logical manipulative aspects (e.g.
>relational model with relational algebra)? Isn't this the point of
>ANSI-SPARC architecture?

Yes, it can, and yes it is.

>I wonder what the ramifications are on data independence if relations,
>whether base, derived, or intermediate, are allowed to incorporate bags in
>order to accomodate optimization.

With "intermediate results" I meant the intermediate results of the steps in
the query evaluation plan. These are usually not seen by the user.

-- Jan Hidders

Steve Kass

unread,
Feb 14, 2003, 9:10:10 PM2/14/03
to
>
>
>>>
>>>
>>Right. But there's no problem here, just as there's no problem
>>with the fact that matrix multiplication is not commutative, even
>>though matrices can be considered a generalization of a set where
>>the multiplication is commutative.
>>
>>
>
>I don't see how matrices can be viewed as generalization of sets. Matrices
>are Linear Transformations, not sets.
>
>
My typo - I meant to say that matrices can be viewed as generalizations
of numbers, not of sets.

SK


>
>

Steve Kass

unread,
Feb 14, 2003, 9:15:45 PM2/14/03
to
Paul,

There's a distinction here that you may be missing. Let's name the
two queries

select distinct SAL from emp -- query A
select distinct SAL from emp order by SAL -- query B

Perhaps unwisely, SQL allows a query to specify a set (of rows) as well
as to specify a result set (presentation of a set of rows).

Query A and Query B both specify a result set, although the ordering of
the rows in the result is only specified in query B. Only Query A specifies
a set, however, since ORDER BY is not a valid clause in a set
specification -
only in a result set specification. Some products might allow Query B to
be used as a set specification, but, for example, SQL Server only allows it
with its proprietary TOP feature, in which case the ORDER BY becomes
a necessary part of the definition of the set of rows, and does not have
anything to do with an ordering within the set.

Steve Kass
Drew University

Daniel S. Guntermann

unread,
Feb 15, 2003, 1:27:40 AM2/15/03
to
Ok. Thank you for the clarification. I realize the whole point of the
discussion is that SQL does allow for bags in base, derived/intermediate
tables....It probably is a very good thing to understand the side affects
that occur with bag algebra since we probably use it more than we know,
though I personally try to avoid it. As others have stated, that is
probably why the Stanford trio address the issue explicitly.

- Dan

"Jan Hidders" <jan.h...@REMOVE.THIS.ua.ac.be> wrote in message

news:3e4d8...@news.ruca.ua.ac.be...

Lauri Pietarinen

unread,
Feb 15, 2003, 2:14:05 AM2/15/03
to
> Some operations on relations require an explicit order: quota query, min,
> max etc. Using physical order in the representation of relations can
> sometimes speed the evaluation of unordered operations like joins.

I agree with this posting on most accounts, but I would like to point out
that min and max do not need explicite ordering of rows, just
an ordering of the elements of the domain, e.g.

select distinct salary
from personnel p
where not exists
( select *
from personnel p2
where p2.salary > p.salary )

Obviously the DBMS would not convert max
to such an expression, but just order the rows or
use an index.

I would be glad if somebody could tell me how counting
and quota queries are expressed in "pure algebra" or
do we need some extensions?

regards,
Lauri Pietarinen

Jan Hidders

unread,
Feb 15, 2003, 4:13:53 AM2/15/03
to
Lauri Pietarinen wrote:
>
>I would be glad if somebody could tell me how counting
>and quota queries are expressed in "pure algebra" or
>do we need some extensions?

You need an extension. An important paper on that is

A. Klug. Equivalence of Relational Algebra and Relational Calculus Query
Languages Having Aggregate Functions. Journal of the ACM, 29(3):699--717,
1982.

Unfortunateley it is not on-line but a paper by Leonid Libkin (the guy is a
phenomenon) on the expressive power of SQL also explains it (and a lot
more):

http://citeseer.nj.nec.com/407307.html

Section 3 page 7, 8 and 9. Normally you could ask me if you have any
questions about this paper, but I will be on a holiday in the next 7 days,
so my response may be a little slow. :-)

-- Jan Hidders

Lauri Pietarinen

unread,
Feb 15, 2003, 6:19:00 AM2/15/03
to
>
>
> SET(SET(A) INTERSECT(SET(B UNION C))) =
> SET(SET((A) INTERSECT SET(B)) UNION (SET(A) INTERSECT SET(C)))
>
>So, although things get a bit more complicated, there is absolutely no
>reason why a query optimizer that is based on a bag algebra would
>perform worse than one that is based on a set algebra.
>

So the point here seems to be that in SQL we can
always use 'DISTINCT' anyway and the optimizer will be able to take
that into account and optimize accordingly?

e.g.

select distinct s.s#
from s, sp
where s.s# = sp.s#

could be converted into

select s.s#
from s
where exists
( select *
from sp
where s.s# = sp.s# ) ?

In principle this is correct, but:

- more work for the optimizer groops
because it has to support two "modes";
hence bigger and buggier products.

- optimizing efforts will be concentrated on
the most used features, that being queries
without disticnt.

I would draw a parallel with using GOTO's
in programming languages:

- GOTO's give programmers more freedom to do what
they want

- however, the programs are harder to optimize, so programs will
end up being bigger and running slower

- compilers will be harder to write

- the programs will be probably be buggier

- and as Dijkstra has pointed out, everything can be done
_without_ GOTO's anyway.


By point here being that by providing users with both
tuple bags and relations (by use of distinct) we are actually
causing more confusion for all!

regards,
Lauri Pietarinen

Paul Vernon

unread,
Feb 15, 2003, 7:29:32 AM2/15/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2k7pe$57v$1...@slb0.atl.mindspring.net...

> Paul,
>
> There's a distinction here that you may be missing. Let's name the
> two queries
>
> select distinct SAL from emp -- query A
> select distinct SAL from emp order by SAL -- query B
>
> Perhaps unwisely, SQL allows a query to specify a set (of rows) as well
> as to specify a result set (presentation of a set of rows).

Steve thanks for highlighting the distinction. I think the fact that I sort of
missed it is evidence of the unwiseness of the SQL approach.

> Query A and Query B both specify a result set, although the ordering of
> the rows in the result is only specified in query B. Only Query A specifies
> a set, however, since ORDER BY is not a valid clause in a set
> specification -
> only in a result set specification. Some products might allow Query B to
> be used as a set specification, but, for example, SQL Server only allows it
> with its proprietary TOP feature, in which case the ORDER BY becomes
> a necessary part of the definition of the set of rows, and does not have
> anything to do with an ordering within the set.

FYI, DB2 V8 allows the following. Not sure if it is in the SQL standard

SELECT C1 FROM
(SELECT C1 FROM T1
UNION
SELECT C1 FROM T2
ORDER BY C1 ) AS UTABLE
ORDER BY ORDER OF UTABLE

and

SELECT EMP_ACT.EMPNO,PROJNO
FROM EMP_ACT
WHERE EMP_ACT.EMPNO IN
(SELECT EMPLOYEE.EMPNO
FROM EMPLOYEE
ORDER BY SALARY DESC
FETCH FIRST 10 ROWS ONLY)

Jan Hidders

unread,
Feb 15, 2003, 12:11:49 PM2/15/03
to
Lauri Pietarinen <lauri.pi...@atbusiness.com> wrote in message news:<3E4E2224...@atbusiness.com>...

>Jan Hidders wrote:
>>
>> SET(SET(A) INTERSECT(SET(B UNION C))) =
>> SET(SET((A) INTERSECT SET(B)) UNION (SET(A) INTERSECT SET(C)))
>>
>>So, although things get a bit more complicated, there is absolutely
no
>>reason why a query optimizer that is based on a bag algebra would
>>perform worse than one that is based on a set algebra.
>
>So the point here seems to be that in SQL we can always use
'DISTINCT'
>anyway and the optimizer will be able to take that into account and
>optimize accordingly?

Yes. But let me stress that I am not a fan of SQL; as a bag calculus
it
is also less elegant then it could have been, to put it mildly.

>In principle this is correct, but:
>
>- more work for the optimizer groops because it has to support two
"modes";
> hence bigger and buggier products.

There is only one mode, the bag mode, and it has to take those into
account
anyway even if the user only sees sets because for optimization
purposes a
bag-view is sometimes more optimal. And as I already said, the added
complexity is very little. Don't you find it remarkable that most
people who
claim that bags makes things much more difficult are the ones who have
never
have done any real research on query optimiziation or built a full
scale
query optimizer? I don't. In that respect Ullman, Widom and Molina can
run circles around Date whose expertise in this area is very shallow,
to say the least.

>- optimizing efforts will be concentrated on the most used features,
that
> being queries without disticnt.

If those queries are the ones that are the most used, then that is
what the
users apparently want, and so indeed those should be optimized the
most.

>I would draw a parallel with using GOTO's in programming languages:
>- GOTO's give programmers more freedom to do what
> they want
>- however, the programs are harder to optimize, so programs will
> end up being bigger and running slower
>- compilers will be harder to write
>- the programs will be probably be buggier
>- and as Dijkstra has pointed out, everything can be done
> _without_ GOTO's anyway.

The optimization of imperative programs is very different from
optimizing declarative query languages, so such an analogy is
virtually meaningless to people who actually know a thing or two about
optimization. Btw. What makes you think that programs with GOTOs are
hard to optimize? Dijkstra did not mention that as an argument in his
famous letter.

That bags are hard to understand is a myth. I have been teaching SQL
at university level and below that; there are several things about SQL
that are hard to explain but bags was not one of them.

Kind regards,

-- Jan Hidders

Lauri Pietarinen

unread,
Feb 15, 2003, 12:43:17 PM2/15/03
to
It might be interesting to note that Dataphor makes a clear distiction
between relations and result sets that are returned to the user
(or program etc...):

We would have the relational expression

emp OVER {salary} WHERE deptno = 1;

and then when it is returned we have

SELECT emp OVER {salary} WHERE deptno = 1 ORDER BY salary;

Quota queries look like this:

emp OVER {salary}
WHERE deptno = 1
RETURN 10 BY {salary desc}

so also making a distinction between ordering and
quota queries.

regards,
Lauri Pietarinen


Lauri Pietarinen

unread,
Feb 15, 2003, 12:53:17 PM2/15/03
to
>
>
>Don't you find it remarkable that most
>people who
>claim that bags makes things much more difficult are the ones who have
>never
>have done any real research on query optimiziation or built a full
>scale
>query optimizer? I don't. In that respect Ullman, Widom and Molina can
>run circles around Date whose expertise in this area is very shallow,
>to say the least.
>

I can confess to not have done any research on optimization or ever
built a DBMS or even part of
it. As such I don't have very much authority on this.

But I would like to point out that Hugh Darwen was one of the main
architects for a RDBMS
product in the early 80's and he clearly agrees with Date as far as I
can tell.

I am just trying to get a hold on this matter, what is it exactly that
Date get's wrong or missunderstands.
Are they trying to achieve different goals? Maybe my head is a bit
thick but I am not
convinced.

best regards,
Lauri Pietarinen


Paul Vernon

unread,
Feb 15, 2003, 2:26:36 PM2/15/03
to
"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message
news:3E4E7E8D...@atbusiness.com...
[snip]

> I am just trying to get a hold on this matter, what is it exactly that
> Date get's wrong or missunderstands.
> Are they trying to achieve different goals? Maybe my head is a bit
> thick but I am not convinced.

For what it is worth, from this discussion I am still convinced that bags are
a bad model for users, because to quote from Date's second article,

"[with bags] the range of possible results that can be obtained from any given
database is
vastly expanded, though not usefully so"
"[with bags] the range of logically distinct query expressions that can be
formulated is
vastly expanded, too, though again not usefully so"

In otherwords, Occams razor comes into play.

The extra complexity of a bag algebra (even if it only 'a bit' more complex)
is of no use for users, at least in mine and Chris Date's opinion. Reminds me
of transactions actually, again possibilities get expanded but not usefully so
(except maybe for those that teach ;-) ;-) ).

As an internal bag algebra might well help implementations provide better
performance, I can see arguments for exposing such an algebra to users to
possibly make the implementators life easier. However, that is the tail
wagging the dog. I can only see costs in a bag algebra for users.

Lauri Pietarinen

unread,
Feb 15, 2003, 4:51:46 PM2/15/03
to
>
>
>For what it is worth, from this discussion I am still convinced that bags are
>a bad model for users, because to quote from Date's second article,
>
Thanks for the support, Paul ;-)

>The extra complexity of a bag algebra (even if it only 'a bit' more complex)
>is of no use for users, at least in mine and Chris Date's opinion. Reminds me
>of transactions actually, again possibilities get expanded but not usefully so
>(except maybe for those that teach ;-) ;-) ).
>

Same for GOTO's ;-)

>As an internal bag algebra might well help implementations provide better
>performance, I can see arguments for exposing such an algebra to users to
>possibly make the implementators life easier. However, that is the tail
>wagging the dog. I can only see costs in a bag algebra for users.
>

How does it make implementors life easier? They have to support
'distinct' anyway, don't
they? Well, up to now 'distinct' hasn't been very well supported, so
NOT supporting
it has made their life easier, I guess...

regards,
Lauri Pietarinen

Anne & Lynn Wheeler

unread,
Feb 15, 2003, 5:29:48 PM2/15/03
to
"David Cressey" <in...@dcressey.com> writes:
> My recollection was that is was to compete with IBM. DB2 used SQL,
> and it was easier to sell Rdb if the people who already knew SQL
> didn't have to learn something else. This is the way most "de
> facto" standards come into being.

there was significant resistance from STL to system/r. some of the
arguments were hierarchical vis-a-vis relational .... but probably as
vociferous was arguments regarding physical pointers vis-a-vis indexes
(i.e. that the indexes doubled the physical space).

so the technical transfer went from sjr to endicott ... aka from
system/r to sql/ds (i was responsible for some from sjr to end). it
was later that there was technology transfer from endicott back to STL
... aka sql/ds to db2. one of the people in the following meeting
handled much of the endicott->stl transfer:
http://www.garlic.com/~lynn/95.html#13

general reference:
http://www.mcjones.org/System_R/
http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95-Contents.html

multics first commercial rdbms:
http://www.mcjones.org/System_R/mrds.html

teradata, ingres, relational tech, britton-lee, sybase, m'soft
http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95-Teradata.html

sql/ds
http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95-SQL_DS.html

shoot-out at the ok corral
http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95-Shoot-ou.html

.... note: quibble in the shoot-out tail mentioned above with regard to
compare-and-swap instruction and locking. compare-and-swap instruction
was the work of person in cambridge science center who's initials are
CAS ... and the choice of the instruction mnemonic ... so then needed
to come up with inustruction name that matched his initials.

misc. qbe
http://www.garlic.com/~lynn/2002e.html#44 SQL wildcard origins?
http://www.garlic.com/~lynn/2002o.html#70 Pismronunciation

vs/query (qmf)
http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95-VS_QUERY.html


random past posts rf: system/r:
http://www.garlic.com/~lynn/aadsm13.htm#8 OCSP and LDAP
http://www.garlic.com/~lynn/2000.html#18 Computer of the century
http://www.garlic.com/~lynn/2000b.html#55 Multics dual-page-size scheme
http://www.garlic.com/~lynn/2000e.html#49 How did Oracle get started?
http://www.garlic.com/~lynn/2000f.html#16 [OT] FS - IBM Future System
http://www.garlic.com/~lynn/2001d.html#44 IBM was/is: Imitation...
http://www.garlic.com/~lynn/2001i.html#32 IBM OS Timeline?
http://www.garlic.com/~lynn/2002.html#10 index searching
http://www.garlic.com/~lynn/2002e.html#26 Crazy idea: has it been done?
http://www.garlic.com/~lynn/2002e.html#44 SQL wildcard origins?
http://www.garlic.com/~lynn/2002g.html#58 Amiga Rexx
http://www.garlic.com/~lynn/2002g.html#59 Amiga Rexx
http://www.garlic.com/~lynn/2002g.html#60 Amiga Rexx
http://www.garlic.com/~lynn/2002g.html#76 Pipelining in the past
http://www.garlic.com/~lynn/2002h.html#17 disk write caching (was: ibm icecube -- return of
http://www.garlic.com/~lynn/2002i.html#69 Hercules and System/390 - do we need it?
http://www.garlic.com/~lynn/2002k.html#9 Avoiding JCL Space Abends
http://www.garlic.com/~lynn/2002l.html#71 Faster seeks (was Re: Do any architectures use instruction
http://www.garlic.com/~lynn/2002n.html#36 VR vs. Portable Computing
http://www.garlic.com/~lynn/2002o.html#54 XML, AI, Cyc, psych, and literature
http://www.garlic.com/~lynn/2002q.html#32 Collating on the S/360-2540 card reader?

topic drift re compare and swap:
http://www.garlic.com/~lynn/93.html#14 S/360 addressing
http://www.garlic.com/~lynn/94.html#45 SMP, Spin Locks and Serialized Access
http://www.garlic.com/~lynn/97.html#19 Why Mainframes?
http://www.garlic.com/~lynn/98.html#8 ** Old Vintage Operating Systems **
http://www.garlic.com/~lynn/98.html#16 S/360 operating systems geneaology
http://www.garlic.com/~lynn/99.html#89 FIne-grained locking
http://www.garlic.com/~lynn/99.html#176 S/360 history
http://www.garlic.com/~lynn/99.html#203 Non-blocking synch
http://www.garlic.com/~lynn/2000e.html#25 Test and Set: Which architectures have indivisible instructions?
http://www.garlic.com/~lynn/2000g.html#16 360/370 instruction cycle time
http://www.garlic.com/~lynn/2001b.html#35 John Mashey's greatest hits
http://www.garlic.com/~lynn/2001f.html#41 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001g.html#9 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001k.html#8 Minimalist design (was Re: Parity - why even or odd)
http://www.garlic.com/~lynn/2001k.html#12 Minimalist design (was Re: Parity - why even or odd)
http://www.garlic.com/~lynn/2001k.html#66 SMP idea for the future
http://www.garlic.com/~lynn/2001k.html#69 Programming in School (was: Re: Common uses...)
http://www.garlic.com/~lynn/2002f.html#13 Hardware glitches, designed in and otherwise
http://www.garlic.com/~lynn/2002h.html#45 Future architecture [was Re: Future micro-architecture: ]
http://www.garlic.com/~lynn/2002h.html#46 Future architecture
http://www.garlic.com/~lynn/2002h.html#55 Future architecture [was Re: Future micro-architecture: ]
http://www.garlic.com/~lynn/2002h.html#87 Atomic operations redux
http://www.garlic.com/~lynn/2002i.html#82 HONE
http://www.garlic.com/~lynn/2002l.html#69 The problem with installable operating systems
http://www.garlic.com/~lynn/2002p.html#58 AMP vs SMP
http://www.garlic.com/~lynn/2003b.html#20 Card Columns


--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Internet trivia 20th anv http://www.garlic.com/~lynn/rfcietff.htm

Lauri Pietarinen

unread,
Feb 15, 2003, 7:27:16 PM2/15/03
to
Thanks for the link.

I'll try to figure it out...

Lauri


Jan Hidders

unread,
Feb 16, 2003, 5:53:31 AM2/16/03
to
"Paul Vernon" <paul....@ukk.ibmm.comm> wrote in message news:<b2m4he$eli$1...@sp15at20.hursley.ibm.com>...

> "Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message
> news:3E4E7E8D...@atbusiness.com...
> [snip]
> > I am just trying to get a hold on this matter, what is it exactly that
> > Date get's wrong or missunderstands.
> > Are they trying to achieve different goals? Maybe my head is a bit
> > thick but I am not convinced.
>
> For what it is worth, from this discussion I am still convinced that bags are
> a bad model for users, because to quote from Date's second article,
>
> "[with bags] the range of possible results that can be obtained from any given
> database is
> vastly expanded, though not usefully so"
> "[with bags] the range of logically distinct query expressions that can be
> formulated is
> vastly expanded, too, though again not usefully so"
>
> In otherwords, Occams razor comes into play.

Not to be pedantic but Ockham's razor is for scientific theories and
the relational model is no such thing. However, it is an engineering
method and as
such KISS applies, which is related,

Anyway, the above quotes by Date are strictly speaking correct but in
this context misleading. The extra queries you can do are isomorphic
to queries that could already be done without bags. So in terms of
query optimization no extra problems are introduced that weren't there
already.

> As an internal bag algebra might well help implementations provide better
> performance, I can see arguments for exposing such an algebra to users to
> possibly make the implementators life easier.

I would not expose the algebra anyway, it is there for implementation
purposes. The users should be exposed to a bag *calculus*. But this is
another matter.

> However, that is the tail wagging the dog.
> I can only see costs in a bag algebra for users.

Since when is better query optimization a cost? And that bags are
harder to understand than sets is hardly what I would call a
scientific theory. My personal experience as a professional teacher
suggests otherwise, and the arguments I have seen to support it are
hardly any better than handwaving.

-- Jan Hidders, who is signing off now because his plane will be
leaving in a few hours.

Bob Badour

unread,
Feb 16, 2003, 12:57:38 PM2/16/03
to
"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
news:51d64140.03021...@posting.google.com...
> "Bob Badour" <bba...@golden.net> wrote in message
news:<g8W2a.1405$Ay3.16...@mantis.golden.net>...
> > Some operations on relations require an explicit order: quota query,
min,
> > max etc.
>
> Isn't there two separate issues here?
> 1) order on domains
> 2) order of tuples within a relation

My whole point, Paul, was the relational model disallows implicit order of
tuples within relations; however, it does allow operations that rely on an
explicit order of tuples within relations. Any explicit order necessarily
depends on the possible collating sequences of the values derivable from the
attributes in the tuples of the relation.


> The first would be generally desirable (not sure about things like
> image-valued domains though) and is needed for the MIN & MAX aggregate
> operations.
>
> The second isn't really part of relational theory but appears to be
> needed for quota queries (I assume this is things like "SELECT TOP 5 *
> FROM emp ORDER BY salary"). Although thinking about it this ORDER BY
> is really different to the standard "SELECT * FROM emp ORDER BY
> salary":

I disagree. Both ORDER BY clauses explicitly order the tuples of the input
relation as required for some operation. One operation results in an
unordered set, ie. a relation with the five highest ranking tuples, and the
other results in an external physical representation of a relation in a
specified order. However, the contribution of the ORDER BY clause is
identical in both. (I don't much like the syntax you used for the quota
query because it is not at all clear that the order by applies to the TOP
operation. How would the user tell the dbms to output the resulting five
tuples in ascending alphabetical order by last name?)


> I suppose that in theory a query language could do without an ORDER BY
> and have any ordering done in the client application.

Codd's RM/V2 excludes any such thing when it requires theta-operations like
less than, greater than, least greater, greatest lesser etc. Any data
management language unable to order data would lack utility for evaluating
extrema and for partitioning data, which are fundamentally data management
tasks.


> But for physical
> reasons it would be more efficient for the DBMS to choose an
> optimization method that would output the rows in a specified order.

It might and it might not. It depends on the query and any declared
constraints to leverage or enforce.


Bob Badour

unread,
Feb 16, 2003, 1:34:11 PM2/16/03
to
"Paul Vernon" <paul....@ukk.ibmm.comm> wrote in message
news:b2isne$16ui$1...@sp15at20.hursley.ibm.com...

> "Bob Badour" <bba...@golden.net> wrote in message
> news:l7W2a.1404$7C3.16...@mantis.golden.net...
> > Paul,
> >
> > I can certainly understand your criticism of ORDER BY if it were applied
to
> > SELECT expressions nested within other operations such as INSERT INTO,
> > CREATE VIEW etc. -- if those are even allowed by the standard or by any
> > product.
> >
> > I am not sure I understand your criticism of ORDER BY for presentation.
In
> > this situation, doesn't ORDER BY apply to the operation that physically
> > encodes the result for transmission outside the DBMS? It seems to me the
> > operation in question is quite explicit even if the syntax requires no
> > keyword for it.
>
> Maybe it is a matter of taste, but I find it highly non-explict that
>
> select distinct SAL from emp
>
> returns a Relation, but
>
> select distinct SAL from emp order by SAL
>
> returns something else.

Except that 'select distinct SAL from emp' as a complete statement does not
return a relation. It returns the external physical representation of a
relation in an arbitrary order.


> > Even if one were to type-cast the result to an implicitly ordered
non-scalar
> > type such as an array, one would still require a final operation that
> > physically encodes the array for transmission outside the DBMS.
> >
> > To put it precisely, in both cases the DBMS starts with an internal
physical
> > encoding of an abstract logical value and converts it to an external
> > physical encoding of the same value.
> >
> > Why is it necessary or desirable to specify two operations instead of
one?
>
> Necessary for good language design and conceptual clarity I would say.
> Implicit casting is not a very good idea at the best of times and
certanlly
> not for something as big as a change from one non-scalar type to another.

We are not talking about implicit casting to a non-scalar type within the
type system of the dbms; we are talking about an operation that encodes data
into an external physical representation. While the language requires no
keyword for this operation, it is quite explicit. A select statement is
different from a select expression.


> > IE. one operation that changes the type using an explicit order followed
by
> > a second operation that physically encodes the ordered type VS. a single
> > operation that physically encodes the result in a specified order
>
> CAST_AS_ARRAY(select SUM(SAL), DEPT from emp)
> Y_ORDERED_BY SAL
>
> would be logicaly more correct that the SQL syntax, but finding a nice
syntax
> (which that above certainly is not) is another matter.

I disagree. What you have given above involves a redundant type conversion
that the dbms should optimize away in any case. The final result of the
operation is not an array--it is an external physical representation of a
set of tuples in an explicit order.


> > Referring to the concepts in TTM, it seems like a point of design with
> > interesting design tradeoffs. When designing type generators for
relations
> > and arrays, what possible representations will the generated types have?
> > What are the pro's and con's of choosing one set of possible
representations
> > over another?
>
> Poss reps are really only for scalar types.

I don't see how one can have any usable type without at least one possible
representation. However, after thinking about things, I realise the issue
here involves operations for converting tuples and relations into external
physical representations, and has little to do with internal possible
representations. The possible representations of the values in a tuple or
relation determine the possible representations of the tuple or relation.


> > It sounds like you may have already put more thought into these design
> > tradeoffs than I have, and I am interested in your opinions.
>
> Possibly. It's certainly an interesting area when one considers 'multiple
> selection' and then a mapping of databases and database subsets to other
> non-scalar variables. For example, what is the non-scalar type that is
> represented by say an OLAP reporting tool or maybe an email application?
How
> much independence is there between the graphical display of information
and
> non-scalar types. My suspision is that jumping straight from relations to
> graphical displays is too much of a leap, and what is required is
possiably a
> number of intermidiate non-scalar types that allow for data display
> specificaton and more limited but more intuative data manipulation
options.

Isn't the real issue here a question of how to convert end-user gestures
into appropriate dbms queries? Aren't the types used in an OLAP tool or
email application necessarily external types from the perspective of the
DBMS?

Granted, I would like to see systems built with sufficient integration that
the internal/external distinction becomes almost moot. Programming
environments sophisticated enough to work directly with the higher-level
types of the dbms would be great, and I agree that operations to cast
relations into arrays would require an explicit order specification.


Bob Badour

unread,
Feb 16, 2003, 1:46:24 PM2/16/03
to
"Lauri Pietarinen" <lauri.pi...@atbusiness.com> wrote in message
news:3E4E7C35...@atbusiness.com...

> It might be interesting to note that Dataphor makes a clear distiction
> between relations and result sets that are returned to the user
> (or program etc...):
>
> We would have the relational expression
>
> emp OVER {salary} WHERE deptno = 1;
>
> and then when it is returned we have
>
> SELECT emp OVER {salary} WHERE deptno = 1 ORDER BY salary;

I have to say I like this better than SQL, and I think it obviates the
distinctions leading to Paul's comments. In Dataphor, the unadorned relation
express is always a relation, and the operation to convert the data to an
external physical representation requires an explicit SELECT keyword. It's
clearer when different things look different.


> Quota queries look like this:
>
> emp OVER {salary}
> WHERE deptno = 1
> RETURN 10 BY {salary desc}
>
> so also making a distinction between ordering and
> quota queries.

I think this is much clearer and easier to use than some of the quota query
specifications I have seen. (I guess TOP...ORDER BY is a proprietary
SqlServer extension?)

It's very clear that one can output the results in a different order:

SELECT emp OVER {salary} WHERE deptno = 1 RETURN 10 BY {salary desc} ORDER
BY surname;


Paul

unread,
Feb 17, 2003, 6:30:12 AM2/17/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message news:<nGb3a.11$yd...@news.oracle.com>...

> "Paul" <pbra...@cosmos-uk.co.uk> wrote in message
> news:51d64140.03021...@posting.google.com...
> > What's the interpretation given to a table with duplicate rows?
>
> Interpretation is not always the driving force of the theory. Widely
> successful theory can have poor interpretation. Quantum Mechanics Copenhagen
> Interpretation, for example, has some conflicts with "common sence".

But in science the "reality" comes first and then the model is
designed to fit.
With maths you create your model first and the actual applications of
it (like relational databases) come afterwards. So it's important to
get the foundation and the motivation right. To me this seems one of
the biggest problems with multisets.



> > So is "multi-set" relational algebra not based on predicate logic?
>
> Correct.

So what is it based on? What is the interpretation?



> > Also, there's no way for users to identify a row because the thing
> > that differentiates them is held internally and is thus invisible to
> > the user - and this breaks the logical/physical distinction.
>
> This is a subtle issue. For example, a number 12 is 10^1*1+2*10^0. Any
> numerical value can be thought as an aggregate of its base coefficients. We
> could even store it in the database in dismembered state

But isn't there a distinction between the abstract number itself and
any given representation of it?



> Adding count column is certainly possible, as Bob demonstrated it yet one
> more time. I'm just verifying if this practice is not inferior to explicitly
> having multiset concept in the model.

OK suppose I have an employee "relation" which is a multiset.
I have two employees called John Smith, in the same dept on the same
salary.
So my multi-relation contains two outwardly identical tuples:
("John Smith", 10, 20000)
("John Smith", 10, 20000)

Now one of the John Smiths has a sex-change and becomes Jane Smith.

How does the user update only one of the rows?
Surely it's impossible because the two rows are only distinguished
internally?

Or are you saying that base relations must be sets, derived relations
can be multi-sets? Aren't these two categories supposed to be
interchangeable though?



> I'm not claiming that multisets are superior to sets. There seems to be some
> some cases where they do, and I want to see the whole picture.

Can't multisets always be defined in terms of sets?
So you could for example define a multiset [a,a,b,c,d,d,d]
as the set {(a,1),(a,2),(b,3),(c,4),(d,5),(d,6),(d,7)}
(where (x,y) denotes an ordered pair which you could define as the set
{{x},{x,y}} if you wanted)

or maybe {(a,2),(b,1),(c,1),(d,3)} would be better because it doesn't
add extra information (sequential order). Which is really the same as
adding the extra "count" column I guess.

So multisets really just are syntactic shorthand. Now I suppose
ultimately everything is syntactic shorthand for set theory :) but I'm
not convinced that at the logical level the advantages of multisets
outweigh the disadvantages. Quite possibly it might be different for
the physical level though.

In fact I'd say confusion over unique identifiers for (possibly
derived) relations is one of the common causes of problems I see in
practice.

> Given that multiset is essentially a mapping of a rows to nonnegative
> integers, wouldn't it be more advantageous to generalize it to mapping of a
> rows to reals?

Well this is straying off-topic I guess since databases are
necessarily finite so a suitably large finite set of integers would
suffice.
But what if you wanted a multiset of even greater cardinality than the
reals? e.g. the set of all subsets of the reals?

Paul.

Costin Cozianu

unread,
Feb 17, 2003, 12:53:53 PM2/17/03
to
>
> No. In fact, in theory, all optimizations that can be done in a set-based
> algebra can also be done in a bag-based algebra but not the other way
> around.
>
> -- Jan Hidders
>
>

Hi Jan,
I followed this discussion, interesting as always. But I'm puzzled by
this conjecture. Has somebody studied it ?

Here's my take on it:

- I don't see in any useful sense how can bag algebra be a superset of
set algebra. The obvious inclusion mapping from sets to bags is not a
morphism : Bag(a \/ b) = Bag(a) \/ Bag(b) doesn't hold

- Bag on the other hand are a special kind of sets. Aren't you defining
bags as a function from a universe of elements to Natural numbers ?
But also bag operations are not a straight mapping of corresponding set
operation

- therefore my suspicion is that neither algebraic structures are
isomoprhic to a substructure of the other.

- it is the rule rather than the exception, that the stringer algebraic
properties a structure has , the better suited for optimization it is.
Sets have the well known algebraic properties of lattices. Bag algebra
is a ... (bag algebra I presume) ? All my algebraic books just don't
study the algebra of bags. On the contrary, lattices are widely studied,
presumably this situation

- I don't find it true that all optimizations available for set algebras
are available for bag algebra. There are semantic constraints to keep
the operation on bags operation on bags and not operations on sets.
SQL optimizers, for example, are constrained to keep track of the number
of rows by some very uninteresting (and irrelevant to end users, me
thinks) rules.

You recurred at some point to a proof by authority (Ullman, Molina vs.
Date), but I'm curious if you have a stated opinion of Molina et all
vis-a-vis bag algebra being more optimizable than set alegbra, or is it
just a conjecture of yours. Maybe we should consider that in their
position they have to teach students that will be hired in the DBMS
industry.

I guess, what I want to say that quite a few conjectures flew around in
this discussion, but is there some theoretical substance to such
conjectures? Can you provide some references.

best regards,
Costin Cozianu

Mikito Harakiri

unread,
Feb 17, 2003, 3:03:57 PM2/17/03
to
"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
news:51d64140.03021...@posting.google.com...
> > > What's the interpretation given to a table with duplicate rows?
> So what is it based on? What is the interpretation?

"Partial" or "Weighted" predicates. Each predicate is weighted with a
number, so that we could do some ariphmetics upon predicates.

For example, the fact:

Item
----
Milk

in the multiset model is telling to the end user that there is *at least*
one milk container. In the set model, it tells to the end user that there is
*exactly* one milk container.

> But isn't there a distinction between the abstract number itself and
> any given representation of it?

Consider a vector from linear algebra. Abstract vector. Then, according to
your logic, you would insist on introducing a special domain for it, right?
On the other hand, it's very convenient to dismember a vector into
components and cosider vector operations as aggregate queries.

Representation is just a view of an abstract entity! A user should be
allowed to freely transform one representation into another. That
transformation should be explicit, i.e. defined in a declarative query
language.

> OK suppose I have an employee "relation" which is a multiset.
> I have two employees called John Smith, in the same dept on the same
> salary.
> So my multi-relation contains two outwardly identical tuples:
> ("John Smith", 10, 20000)
> ("John Smith", 10, 20000)
>
> Now one of the John Smiths has a sex-change and becomes Jane Smith.
>
> How does the user update only one of the rows?
> Surely it's impossible because the two rows are only distinguished
> internally?

The fact

("John Smith", 10, 20000)

in multiset theory is not a complete predicate. Therefore, we can't just
blindly modify it.

The <name, dept> is obviously a unique key. The question should be how do
we impose constraints in the multiset theory.

> Can't multisets always be defined in terms of sets?
> So you could for example define a multiset [a,a,b,c,d,d,d]
> as the set {(a,1),(a,2),(b,3),(c,4),(d,5),(d,6),(d,7)}
> (where (x,y) denotes an ordered pair which you could define as the set
> {{x},{x,y}} if you wanted)

The biggest problem with (x,y)={{x},{x,y}} definition is that it's not
associative:

<a,<b,c>> != <<a,b>,c>

Therefore, we can't consistently define ordered tuple with 3 components.
This is why the above definition is just a toy that is never used in
practice. Once again, Set Reduction doesn't work!

> or maybe {(a,2),(b,1),(c,1),(d,3)} would be better because it doesn't
> add extra information (sequential order). Which is really the same as
> adding the extra "count" column I guess.

In multiset theory we are studying why

{(a,1),(a,2),(b,3),(c,4),(d,5),(d,6),(d,7)}

and

{(a,2),(b,1),(c,1),(d,3)}

are equivalent. In the set theory, however, there is nothing special about
these 2 representations.

> So multisets really just are syntactic shorthand. Now I suppose
> ultimately everything is syntactic shorthand for set theory :) but I'm
> not convinced that at the logical level the advantages of multisets
> outweigh the disadvantages. Quite possibly it might be different for
> the physical level though.

My problem is that I'm so frequently blurring distinction between logical
and physical, that distinction of logical and physical levels is no longer a
driving principle. It seems to be a price to pay for more poweful
computation model. Could this model be reasonably "declarative" that is the
question.

> Well this is straying off-topic I guess since databases are
> necessarily finite so a suitably large finite set of integers would
> suffice.
> But what if you wanted a multiset of even greater cardinality than the
> reals? e.g. the set of all subsets of the reals?

No, I rather meant that nonnegative integer set is so obviously
algebraically nonclosed that it might ultimately show up. Cardinalities and
powersets in the set theory -- "set of integer sets" -- are notoriously
useless for anything more practical than measuring the length of Turing
tape.


Steve Kass

unread,
Feb 17, 2003, 6:22:11 PM2/17/03
to

Mikito Harakiri wrote:

>"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
>news:51d64140.03021...@posting.google.com...
>
>
>
>

>>OK suppose I have an employee "relation" which is a multiset.
>>I have two employees called John Smith, in the same dept on the same
>>salary.
>>So my multi-relation contains two outwardly identical tuples:
>>("John Smith", 10, 20000)
>>("John Smith", 10, 20000)
>>
>>Now one of the John Smiths has a sex-change and becomes Jane Smith.
>>
>>How does the user update only one of the rows?
>>Surely it's impossible because the two rows are only distinguished
>>internally?
>>

I think you are confusing your employees with your model.
If we have only the two employees named John Smith, in
department 10 with salary 20000, and we are storing this
information in a multiset, we have only _one_ fact, not two:
"There are two John Smiths in department 10 with salary
20000". Two employees, but only one fact.

If we wish to change our data so that it now represents two
facts:
"There is one John Smith in department 10 with salary 20000".
"There is one Jane Smith in department 10 with salary 20000".

we do so without any difficulty. There are many ways to
devise a physical representation of our multiset, and how we
transform it depends what the representation is. An easy one
to use, where "fact" corresponds to something physical, is to
store multiplicities, so that we initially have

"John Smith", 10, 20000, 2

and then have

"John Smith", 10, 20000, 1
"Jane Smith", 10, 20000, 1

We've gone from one fact to two facts, so we have two
"rows", neither of which matches the original row. I don't
think there's any mystery or difficulty here, so long as you
understand that there is only one fact per distinct "row"
in the multiset, and that fact includes the row data and its
multiplicity, whether or not that single fact corresponds
to two "non-facts", such as employees or data packets
in the real-world model and/or the physical model. As
soon as you violate the conceptual model there will surely
be lots of problems.

>>
>>
>
>The fact
>
>("John Smith", 10, 20000)
>
>in multiset theory is not a complete predicate. Therefore, we can't just
>blindly modify it.
>
>The <name, dept> is obviously a unique key. The question should be how do
>we impose constraints in the multiset theory.
>
>
>
>>Can't multisets always be defined in terms of sets?
>>So you could for example define a multiset [a,a,b,c,d,d,d]
>>as the set {(a,1),(a,2),(b,3),(c,4),(d,5),(d,6),(d,7)}
>>(where (x,y) denotes an ordered pair which you could define as the set
>>{{x},{x,y}} if you wanted)
>>
>>
>
>The biggest problem with (x,y)={{x},{x,y}} definition is that it's not
>associative:
>
><a,<b,c>> != <<a,b>,c>
>

What isn't associative? What does <> represent?
What does it mean for a definition to be associative?
Associativity is a property of binary operations. What
binary operation are you talking about here?

>
>Therefore, we can't consistently define ordered tuple with 3 components.
>This is why the above definition is just a toy that is never used in
>practice. Once again, Set Reduction doesn't work!
>
>

Huh? Many definitions are possible, of course. We could represent
(x,y,z) in any of these ways, I think (I'm sure I messed up the
brackets, though):
{{1,{x}},{2,{y}}, {3,{z}}
{{x},{y,{y}},{z,{z},{{z}}}}
{{x},{x,{y,{y,z}}}} -- This is the one I'd call the most natural extension.
{{x,{x,y}},{{x,{x,y}},z}} -- This is the alternative that bothers you..


Are you saying that a concept is hogwash unless there is a unique way for
it to be presented in terms of sets? That's hogwash!

>
>
>>or maybe {(a,2),(b,1),(c,1),(d,3)} would be better because it doesn't
>>add extra information (sequential order). Which is really the same as
>>adding the extra "count" column I guess.
>>
>>
>
>In multiset theory we are studying why
>
>{(a,1),(a,2),(b,3),(c,4),(d,5),(d,6),(d,7)}
>
>and
>
>{(a,2),(b,1),(c,1),(d,3)}
>
>are equivalent. In the set theory, however, there is nothing special about
>these 2 representations.
>

So?

>
>
>
>>So multisets really just are syntactic shorthand. Now I suppose
>>ultimately everything is syntactic shorthand for set theory :) but I'm
>>not convinced that at the logical level the advantages of multisets
>>outweigh the disadvantages. Quite possibly it might be different for
>>the physical level though.
>>
>>
>
>My problem is that I'm so frequently blurring distinction between logical
>and physical, that distinction of logical and physical levels is no longer a
>driving principle. It seems to be a price to pay for more poweful
>computation model. Could this model be reasonably "declarative" that is the
>question.
>
>

I think you've lost track entirely of any question here,
unless you still think there is no well-founded set-based
model for multisets. If you still think that, try again to
give a reason why, other than that there is more than one
well-founded model. "There is no model, because there
are two models."??

>
>
>>Well this is straying off-topic I guess since databases are
>>necessarily finite so a suitably large finite set of integers would
>>suffice.
>>But what if you wanted a multiset of even greater cardinality than the
>>reals? e.g. the set of all subsets of the reals?
>>
>>
>
>No, I rather meant that nonnegative integer set is so obviously
>algebraically nonclosed that it might ultimately show up. Cardinalities and
>powersets in the set theory -- "set of integer sets" -- are notoriously
>useless for anything more practical than measuring the length of Turing
>tape.
>
>
>

There should be no problem modeling multisets where the
multiplicities can come from any set of cardinal numbers
with addition that contains an additive identity and is
closed under addition. You can find a handy set of cardinals
that includes 2^(2^aleph_0) if that's what you want.


SK

>
>

Mikito Harakiri

unread,
Feb 17, 2003, 6:42:26 PM2/17/03
to

"Steve Kass" <sk...@drew.edu> wrote in message
news:b2rqnp$1tj$1...@slb2.atl.mindspring.net...

> >The biggest problem with (x,y)={{x},{x,y}} definition is that it's not
> >associative:
> >
> ><a,<b,c>> != <<a,b>,c>
> >
>
> What isn't associative? What does <> represent?
> What does it mean for a definition to be associative?
> Associativity is a property of binary operations. What
> binary operation are you talking about here?

I'm sorry: copy and paste typo. I meant

(a,(b,c)) != ((a,b),c)

> Huh? Many definitions are possible, of course. We could represent
> (x,y,z) in any of these ways, I think (I'm sure I messed up the
> brackets, though):
> {{1,{x}},{2,{y}}, {3,{z}}

You have to define 1, 2, 3, first. Assuming

1={{}}
2={{},{{}}}

And substitution into your definition, you'll get yet another difinition for
an ordered pair:

(a,b) = {{{{}},{a}},{{{},{{}}},{b}}}

> {{x},{y,{y}},{z,{z},{{z}}}}
> {{x},{x,{y,{y,z}}}} -- This is the one I'd call the most natural
extension.
> {{x,{x,y}},{{x,{x,y}},z}} -- This is the alternative that bothers you..

This is just a series of random definitions, well spiced with curly
bracketed symbols. Or is there any wonderful theory about it? Where in math
did you see "We have 10 unrelated alternative definitions, pick anyone you
like"?

Bernard Peek

unread,
Feb 17, 2003, 6:58:37 PM2/17/03
to
In message <b2rqnp$1tj$1...@slb2.atl.mindspring.net>, Steve Kass
<sk...@drew.edu> writes


>I think you are confusing your employees with your model.
>If we have only the two employees named John Smith, in
>department 10 with salary 20000, and we are storing this
>information in a multiset, we have only _one_ fact, not two:
>"There are two John Smiths in department 10 with salary
>20000". Two employees, but only one fact.

That depends entirely on an opinion on what constitutes a fact. It's
equally true to say that there are two facts. The first fact is that
there is an employee called John Smith, the second fact is that there
are two John Smiths. If you redefine the word (making it a term of art)
so that it always equals that which is recorded by a *unique* record in
a database then your statement becomes true.

It's usually quite important to be able to distinguish between people
that may have the same name. Let's choose a different example.

I collect books and as sometimes happens I have some duplicates. I have
two copies of the same book, quite often both of them in mint condition.
I have two database entries for that edition. I have two copies of the
book on the shelf. Suppose I give one copy away? I take one copy off of
the shelf and I delete one record from the database.

It makes no difference whether I delete the record that was created when
I acquired the first book or the second. It makes no difference whether
I remove the first book or the second.

I could create a new entity and call it "Acquisition Number" and that's
what most librarians do. Acquisition numbers are simply identity fields.
They should not be considered part of the logical data structure.

--
Bernard Peek
b...@shrdlu.com
www.diversebooks.com: SF & Computing book reviews and more.....

In search of cognoscenti

Mikito Harakiri

unread,
Feb 17, 2003, 7:26:10 PM2/17/03
to
"Bernard Peek" <b...@shrdlu.com> wrote in message
news:JyMJPSAt...@diamond9.demon.co.uk...

> I collect books and as sometimes happens I have some duplicates. I have
> two copies of the same book, quite often both of them in mint condition.
> I have two database entries for that edition. I have two copies of the
> book on the shelf. Suppose I give one copy away? I take one copy off of
> the shelf and I delete one record from the database.
>
> It makes no difference whether I delete the record that was created when
> I acquired the first book or the second. It makes no difference whether
> I remove the first book or the second.
>
> I could create a new entity and call it "Acquisition Number" and that's
> what most librarians do. Acquisition numbers are simply identity fields.
> They should not be considered part of the logical data structure.

Again, the standard answer to this example is "add a #copies column". Your
design with multisets is equivalent to saying "I want quantity -- that is
#copies -- to be in base-1 system. Base-1 system is especially convenient
for additions and subtractions -- that makes insertions and deletions
trivial".

Bob Badour

unread,
Feb 17, 2003, 10:19:12 PM2/17/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:ckb4a.14$7H6...@news.oracle.com...

> "Paul" <pbra...@cosmos-uk.co.uk> wrote in message
> news:51d64140.03021...@posting.google.com...
> > > > What's the interpretation given to a table with duplicate rows?
> > So what is it based on? What is the interpretation?
>
> "Partial" or "Weighted" predicates. Each predicate is weighted with a
> number, so that we could do some ariphmetics upon predicates.
>
> For example, the fact:
>
> Item
> ----
> Milk
>
> in the multiset model is telling to the end user that there is *at least*
> one milk container. In the set model, it tells to the end user that there
is
> *exactly* one milk container.

I disagree. What it says actually depends on the full predicate--internal
and external--and not on whether Item represents a set or a multiset. To my
anglophone human eyes, it looks like it says Milk is an Item.


> > But isn't there a distinction between the abstract number itself and
> > any given representation of it?
>
> Consider a vector from linear algebra. Abstract vector. Then, according to
> your logic, you would insist on introducing a special domain for it,
right?
> On the other hand, it's very convenient to dismember a vector into
> components and cosider vector operations as aggregate queries.
>
> Representation is just a view of an abstract entity! A user should be
> allowed to freely transform one representation into another. That
> transformation should be explicit, i.e. defined in a declarative query
> language.

Are you saying that a set-based relational dbms with full support for
domains would not allow the user to freely transform data among
representations?


> > OK suppose I have an employee "relation" which is a multiset.
> > I have two employees called John Smith, in the same dept on the same
> > salary.
> > So my multi-relation contains two outwardly identical tuples:
> > ("John Smith", 10, 20000)
> > ("John Smith", 10, 20000)
> >
> > Now one of the John Smiths has a sex-change and becomes Jane Smith.
> >
> > How does the user update only one of the rows?
> > Surely it's impossible because the two rows are only distinguished
> > internally?
>
> The fact
>
> ("John Smith", 10, 20000)
>
> in multiset theory is not a complete predicate. Therefore, we can't just
> blindly modify it.
>
> The <name, dept> is obviously a unique key. The question should be how do
> we impose constraints in the multiset theory.

Obviously, it was not a key in the example given. If the multiset is a
multiset and if multisets are desirable, the question asked will have a
satisfying and direct answer.


> > So multisets really just are syntactic shorthand. Now I suppose
> > ultimately everything is syntactic shorthand for set theory :) but I'm
> > not convinced that at the logical level the advantages of multisets
> > outweigh the disadvantages. Quite possibly it might be different for
> > the physical level though.
>
> My problem is that I'm so frequently blurring distinction between logical
> and physical, that distinction of logical and physical levels is no longer
a
> driving principle. It seems to be a price to pay for more poweful
> computation model. Could this model be reasonably "declarative" that is
the
> question.

The distinction is an important one. You do yourself a disservice by losing
sight of it.


Bob Badour

unread,
Feb 17, 2003, 10:32:18 PM2/17/03
to
"Bernard Peek" <b...@shrdlu.com> wrote in message
news:JyMJPSAt...@diamond9.demon.co.uk...
> In message <b2rqnp$1tj$1...@slb2.atl.mindspring.net>, Steve Kass
> <sk...@drew.edu> writes
>
>
> >I think you are confusing your employees with your model.
> >If we have only the two employees named John Smith, in
> >department 10 with salary 20000, and we are storing this
> >information in a multiset, we have only _one_ fact, not two:
> >"There are two John Smiths in department 10 with salary
> >20000". Two employees, but only one fact.
>
> That depends entirely on an opinion on what constitutes a fact. It's
> equally true to say that there are two facts. The first fact is that
> there is an employee called John Smith, the second fact is that there
> are two John Smiths. If you redefine the word (making it a term of art)
> so that it always equals that which is recorded by a *unique* record in
> a database then your statement becomes true.
>
> It's usually quite important to be able to distinguish between people
> that may have the same name. Let's choose a different example.
>
> I collect books and as sometimes happens I have some duplicates. I have
> two copies of the same book, quite often both of them in mint condition.
> I have two database entries for that edition. I have two copies of the
> book on the shelf. Suppose I give one copy away? I take one copy off of
> the shelf and I delete one record from the database.

You seem to have missed the point that there is logically no way to delete
only one of the records without also deleting the other record. One must
rely on proprietary, non-portable features that directly expose the physical
storage model. One cannot simply issue a statement to delete the book. One
must first query the dbms for the physical pointers to matching books then
arbitrarily choose one of the physical pointers to navigate for the
deletion.

This means that all delete processes must follow this two-step method even
when no duplicate books exist.


> It makes no difference whether I delete the record that was created when
> I acquired the first book or the second. It makes no difference whether
> I remove the first book or the second.

It makes a difference whether one can logically address each of the records
independently, though. How would you construct an SQL delete statement to
delete just one of the identical book rows?


> I could create a new entity and call it "Acquisition Number" and that's
> what most librarians do. Acquisition numbers are simply identity fields.
> They should not be considered part of the logical data structure.

Logical identifiers are absolutely required for the logical data model.
That's where logical identifiers are most important. By adding a logical
identifier, you no longer have a multiset and you might as well use a
relational dbms instead.


Bob Badour

unread,
Feb 17, 2003, 10:35:50 PM2/17/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2rqnp$1tj$1...@slb2.atl.mindspring.net...
>
>

Exactly. Multisets have little utility for logical models, because one must
resort to physical locations or addresses to make any use of the duplicates.


Bob Badour

unread,
Feb 17, 2003, 11:09:41 PM2/17/03
to
"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
news:51d64140.03021...@posting.google.com...
> "Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:<nGb3a.11$yd...@news.oracle.com>...
> > "Paul" <pbra...@cosmos-uk.co.uk> wrote in message
> > news:51d64140.03021...@posting.google.com...
> > Adding count column is certainly possible, as Bob demonstrated it yet
one
> > more time. I'm just verifying if this practice is not inferior to
explicitly
> > having multiset concept in the model.
>
> OK suppose I have an employee "relation" which is a multiset.
> I have two employees called John Smith, in the same dept on the same
> salary.
> So my multi-relation contains two outwardly identical tuples:
> ("John Smith", 10, 20000)
> ("John Smith", 10, 20000)
>
> Now one of the John Smiths has a sex-change and becomes Jane Smith.
>
> How does the user update only one of the rows?

One must resort to physical pointers to identify a unique ("John Smith", 10,
20000), which make multisets an obviously inappropriate choice upon which to
base a logical data model.

It's only a conceit or affectation of Mikito's that he's opening people's
eyes to new untried possibilities. He knows his objections do not hold
water; at best, he's just trying to challenge the people in the newsgroup to
recognize the best answers. As pedagogy, I suppose I have no objections to
that.


> So multisets really just are syntactic shorthand. Now I suppose
> ultimately everything is syntactic shorthand for set theory :) but I'm
> not convinced that at the logical level the advantages of multisets
> outweigh the disadvantages. Quite possibly it might be different for
> the physical level though.

In a relational dbms, physical duplicates represent a single fact and could
come from a variety of causes including data distribution, caching and
indexing. The only time a transient duplicate has any importance is during
aggregate evaluation.

The importance of supporting physical multisets during optimization was
simply overstated. The only questions arising for duplicates during
optimization are: Is a duplicate an error? Is duplicate removal required?
Can it be deferred? Should it be deferred?


Bernard Peek

unread,
Feb 18, 2003, 8:08:54 AM2/18/03
to
In message <0ai4a.10$m52.1...@mantis.golden.net>, Bob Badour
<bba...@golden.net> writes


>> I collect books and as sometimes happens I have some duplicates. I have
>> two copies of the same book, quite often both of them in mint condition.
>> I have two database entries for that edition. I have two copies of the
>> book on the shelf. Suppose I give one copy away? I take one copy off of
>> the shelf and I delete one record from the database.
>
>You seem to have missed the point that there is logically no way to delete
>only one of the records without also deleting the other record. One must
>rely on proprietary, non-portable features that directly expose the physical
>storage model. One cannot simply issue a statement to delete the book. One
>must first query the dbms for the physical pointers to matching books then
>arbitrarily choose one of the physical pointers to navigate for the
>deletion.

That's correct if you use a relational database. This is because
relational databases operate under the axiom that there cannot be two
identical records.

>
>This means that all delete processes must follow this two-step method even
>when no duplicate books exist.
>
>
>> It makes no difference whether I delete the record that was created when
>> I acquired the first book or the second. It makes no difference whether
>> I remove the first book or the second.
>
>It makes a difference whether one can logically address each of the records
>independently, though. How would you construct an SQL delete statement to
>delete just one of the identical book rows?

I wouldn't try to do it using SQL. SQL was designed to work on
relational databases, where there cannot be two identical records.

>
>
>> I could create a new entity and call it "Acquisition Number" and that's
>> what most librarians do. Acquisition numbers are simply identity fields.
>> They should not be considered part of the logical data structure.
>
>Logical identifiers are absolutely required for the logical data model.
>That's where logical identifiers are most important. By adding a logical
>identifier, you no longer have a multiset and you might as well use a
>relational dbms instead.

From a purely pragmatic viewpoint I invariably choose databases that at
least claim to be relational. My book collection database has an
acquisition number field.

Mikito Harakiri

unread,
Feb 18, 2003, 12:50:50 PM2/18/03
to
"Bob Badour" <bba...@golden.net> wrote in message
news:KZh4a.9$k32.1...@mantis.golden.net...

> "Mikito Harakiri" <mikha...@ywho.com> wrote in message
> news:ckb4a.14$7H6...@news.oracle.com...
> I disagree. What it says actually depends on the full predicate--internal
> and external--and not on whether Item represents a set or a multiset. To
my
> anglophone human eyes, it looks like it says Milk is an Item.

My typo, again. In multiset model we have

Item
----
Milk
Milk
Soda (no pun intended:-)

whereas, in set model

Item Count
---- -----
Milk 2
Soda 1

Now, the first entry in the multiset model tells that there is at least one
Milk container, while the first entry in the set model tells us that there
are exactly 2 containers. The set model has more precision, here, granted.

> Are you saying that a set-based relational dbms with full support for
> domains would not allow the user to freely transform data among
> representations?

Please, write down a query that transforms

Num
---
2
5

into

Seq# Num
---- ---
1 2
2 2
1 5
2 5
3 5
4 5
5 5

in pure relational syntax. Now, the same, please, without explode/recursion
operator.

> The distinction [between physical and logical] is an important one.


> You do yourself a disservice by losing sight of it.

There are limitations of every model. The question always is if the goals
that one wishes to achieve warrant changing the model.

Steve Kass

unread,
Feb 18, 2003, 4:30:06 PM2/18/03
to

Bob Badour wrote:

I'm afraid I disagree. Multisets have plenty of utility for logical models
of items with multiplicities. As soon as you expect multisets to model
something
else, you're not going to have much luck. It sounds like you are expecting
the multiset to model the individuality of "the duplicates", and if you
are, then
you are thinking of the duplicates as separate facts, and
distinguishable. Multisets
don't model the "individual duplicates" as individual entities. If you
want to do that,
you need to use a set model and include a distinguishing attribute with
the entity.

SK

>
>

Steve Kass

unread,
Feb 18, 2003, 4:43:45 PM2/18/03
to

Mikito Harakiri wrote:

>This is just a series of random definitions, well spiced with curly
>bracketed symbols. Or is there any wonderful theory about it? Where in math
>did you see "We have 10 unrelated alternative definitions, pick anyone you
>like"?
>
>
>

Um, everywhere? Pick up half a dozen graph theory textbooks,
and you're likely to see nearly as many definitions. Some analysis
books define irrationals in terms of Dedekind cuts, others with
limits. Riemann integrals get defined in all kinds of ways. The
function f(z) = e^z has plenty of useful definitions. Fortunately,
mathematics is well-enough founded that the many alternate definitions
of a construct are provably equivalent. I would be surprised to
hear any mathematician suggest that the existence of many equivalent
foundational definitions of a concept indicates that the concept is
useless. There is no "one way" to define a mathematical idea using
sets.

To throw away all of mathematics because there are equivalent
formulations of ideas makes as much sense as to throw away
numbers because Europeans, Chinese, and Arabs write them
differently.

SK

>
>
>
>
>

Bob Badour

unread,
Feb 18, 2003, 4:44:13 PM2/18/03
to
"Bernard Peek" <b...@shrdlu.com> wrote in message
news:zPLtuiIm...@shrdlu.co.uk...

> In message <0ai4a.10$m52.1...@mantis.golden.net>, Bob Badour
> <bba...@golden.net> writes
>
>
> >> I collect books and as sometimes happens I have some duplicates. I have
> >> two copies of the same book, quite often both of them in mint
condition.
> >> I have two database entries for that edition. I have two copies of the
> >> book on the shelf. Suppose I give one copy away? I take one copy off of
> >> the shelf and I delete one record from the database.
> >
> >You seem to have missed the point that there is logically no way to
delete
> >only one of the records without also deleting the other record. One must
> >rely on proprietary, non-portable features that directly expose the
physical
> >storage model. One cannot simply issue a statement to delete the book.
One
> >must first query the dbms for the physical pointers to matching books
then
> >arbitrarily choose one of the physical pointers to navigate for the
> >deletion.
>
> That's correct if you use a relational database. This is because
> relational databases operate under the axiom that there cannot be two
> identical records.

Huh? It's not possible with a relational dbms. The situation is only
possible with a dbms based on multisets or that otherwise expose duplicates
in the logical model. Think about it.


> >This means that all delete processes must follow this two-step method
even
> >when no duplicate books exist.
> >
> >
> >> It makes no difference whether I delete the record that was created
when
> >> I acquired the first book or the second. It makes no difference whether
> >> I remove the first book or the second.
> >
> >It makes a difference whether one can logically address each of the
records
> >independently, though. How would you construct an SQL delete statement to
> >delete just one of the identical book rows?
>
> I wouldn't try to do it using SQL. SQL was designed to work on
> relational databases, where there cannot be two identical records.

SQL was designed to work on multisets not relations. It allows duplicate
rows. If duplicates are useful as you have suggested, I expect a reasonable
and direct answer to my question:

How would you construct an SQL delete statement to delete just one of the
identical book rows?

If not SQL, then some other DML language for a dbms that allows duplicates.


> >> I could create a new entity and call it "Acquisition Number" and that's
> >> what most librarians do. Acquisition numbers are simply identity
fields.
> >> They should not be considered part of the logical data structure.
> >
> >Logical identifiers are absolutely required for the logical data model.
> >That's where logical identifiers are most important. By adding a logical
> >identifier, you no longer have a multiset and you might as well use a
> >relational dbms instead.
>
> From a purely pragmatic viewpoint I invariably choose databases that at
> least claim to be relational. My book collection database has an
> acquisition number field.

Then your database prohibits duplicates obviating your own arguments in
favour of them.

Steve Kass

unread,
Feb 18, 2003, 5:10:10 PM2/18/03
to
Bernard,

This isn't a matter of opinion. There is one determinant: "there are two
employees named John Smith". There are many consequential
truths, such as "there is at least one employee", "there is an employee
whose first name is not Nancy", "there are at least two employees
whose first and last names share a common letter of the alphabet.",
and so on.

I don't deny that it can be important to distinguish between two
John Smiths. If it is, then don't choose a logical model (multisets)
that cannot make that distinction, and then blame your poor choice
on a failure of the model. Or better yet, don't choose an insufficient
logical model (multisets), embed it in a physical model that provides
an attribute not in the logical model (the row number or disk location
of the two separate rows representing John Smith), and evaluate
the logical model on the basis of what the physical super-model
can do.

Are the integers useless for modeling geometry because they
can't distinguish between the number of sides of a triangle
and the ratio of a circle's circumference to its radius? No. If
you represent the integer 3 as the string "3" when you want it to
represent the number of sides of a triangle and as "pi" when you
want it to represent C/r, they do a greate job. Is the set of
five platonic solids the wrong model for a collection of recordings
of Beethoven's Ninth Symphony? No, not if you have a bunch
of identical solids and scratch the waveform of the performance
on the faces.

I'm not redefining any words, but we have a fundamental
difference in understanding logical vs. physical models. You are
saying that the real-world scenario of books in a library must be
represented by a logical model that does not keep track of an
actual attribute of "book" (acquisition number), and then you
blame the model for not being able to distinguish two identical
books, which can only be distinguished in the real world by the
fact that they have different values of the attribute you prohibited
the model from using. Then you choose a physical model that
distinguishes the two books by an equivalent attribute like file line
number or disk sector, claim that it is a faithful model of a logical
model that doesn't track that attribute, and take full advantage
of the extra attribute whose existence you are denying. I would just
be more comfortable if you acknowledged the addtional attribute
your choice of physical model confers on the logical model.

SK

Alfredo Novoa

unread,
Feb 18, 2003, 5:08:48 PM2/18/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message news:<8BT2a.11$O%2....@news.oracle.com>...
> "Alfredo Novoa" <alf...@ncs.es> wrote in message
> news:e4330f45.03021...@posting.google.com...
> > I only want to add that duplicates also ruin updateable views.
> >
> > http://www.pgro.uk7.net/views3_0110.htm
>
> Why? View updates is about ability to solve equations with relational
> operators. While one can certainly claim that Set Algebra is superior to
> Bag Algebra, you have to demonstrate what exact technical difficulties are
> solving equations in the Bag Algebra.

The systematic approaches to the view update problem based on
predicate logic does not work with bags.


Alfredo

Steve Kass

unread,
Feb 18, 2003, 5:20:25 PM2/18/03
to

Bob Badour wrote:

>"Bernard Peek" <b...@shrdlu.com> wrote in message
>news:zPLtuiIm...@shrdlu.co.uk...
>
>
>>In message <0ai4a.10$m52.1...@mantis.golden.net>, Bob Badour
>><bba...@golden.net> writes
>>
>>
>>
>
>
>>>
>>>

>> From a purely pragmatic viewpoint I invariably choose databases that at
>>least claim to be relational. My book collection database has an
>>acquisition number field.
>>
>>
>
>Then your database prohibits duplicates obviating your own arguments in
>favour of them.
>
>

Bravo! This is exactly the point. And this is the same Bernard who
said "Acquisition numbers are simply identity fields. They should


not be considered part of the logical data structure."

Let's see: they should not be considered part of the logical data structure,
but they are absolutely essential to your database. I prefer my physical
models to have a basis in logic, so I would put the acquisition number into
the logical model. You can choose not to, but if you do, don't blame the
logical model for not being able to represent what the physical model can.

Bernard is arguing that physical models should not correspond to
the logical models they purport to implement. Ok. By the way,
I have this Oscar Meyer wiener whistle that's a physical model of the
Cleveland Symphony Orchestra, and you know what? They're a really
lousy orchestra. Wanna hear?

SK

Steve Kass

unread,
Feb 18, 2003, 5:24:42 PM2/18/03
to
Can you be more specific? What operation on what object
cannot be implemented by a physical model faithful to the
logical multiset model? Or are you making a different claim?

SK

Bob Badour

unread,
Feb 18, 2003, 5:27:49 PM2/18/03
to
"Mikito Harakiri" <mikha...@ywho.com> wrote in message
news:otu4a.13$h_3...@news.oracle.com...

> "Bob Badour" <bba...@golden.net> wrote in message
> news:KZh4a.9$k32.1...@mantis.golden.net...
> > "Mikito Harakiri" <mikha...@ywho.com> wrote in message
> > news:ckb4a.14$7H6...@news.oracle.com...
> > I disagree. What it says actually depends on the full
predicate--internal
> > and external--and not on whether Item represents a set or a multiset. To
> my
> > anglophone human eyes, it looks like it says Milk is an Item.
>
> My typo, again. In multiset model we have
>
> Item
> ----
> Milk
> Milk
> Soda (no pun intended:-)
>
> whereas, in set model
>
> Item Count
> ---- -----
> Milk 2
> Soda 1
>
> Now, the first entry in the multiset model tells that there is at least
one
> Milk container, while the first entry in the set model tells us that there
> are exactly 2 containers. The set model has more precision, here, granted.

The first entry in the set model also tells that there is at least one Milk
container, and one must rely on physical position to know that the second
row in the multiset model even exists.


> > Are you saying that a set-based relational dbms with full support for
> > domains would not allow the user to freely transform data among
> > representations?
>
> Please, write down a query that transforms
>
> Num
> ---
> 2
> 5
>
> into
>
> Seq# Num
> ---- ---
> 1 2
> 2 2
> 1 5
> 2 5
> 3 5
> 4 5
> 5 5
>
> in pure relational syntax. Now, the same, please, without
explode/recursion
> operator.

First, we will have to give a name to the original relation containing the
Num column. I will use Foo for convenience. The solution follows:

( ( Integer RENAME Value as Seq# ) JOIN Foo ) RESTRICT Seq# BETWEEN 1 AND
Num )

I assume that Integer is a named relation literal for the set of all
integers and that it has a single attribute, Value. In practice, Integer
would be a named relation literal for a set of integers within some large
finite range of values. Because the operands to the JOIN operation share no
attributes, the JOIN will, of course, evaluate to a cartesian product;
however, even the most primitive optimizers will evaluate the restriction
before the join.

Mikito, I cannot imagine why you would seriously consider recursion for such
a simple expression. Was there some mistake or typo in your challenge?

If you do not like the named literals, we could replace Foo and Integer
above with some suitable anonymous literals:

Foo := RELATION ( TUPLE ( Num INTEGER( 2 ) ), TUPLE( Num INTEGER( 5 ) ) );
Integer := RELATION (
TUPLE ( Value INTEGER( 1 ) )
, TUPLE ( Value INTEGER( 2 ) )
, TUPLE ( Value INTEGER( 3 ) )
, TUPLE ( Value INTEGER( 4 ) )
, TUPLE ( Value INTEGER( 5 ) )
);


> > The distinction [between physical and logical] is an important one.
> > You do yourself a disservice by losing sight of it.
>
> There are limitations of every model. The question always is if the goals
> that one wishes to achieve warrant changing the model.

I have yet to see anyone describe any goal warranting the use of multisets
as the basis of a logical data model as opposed to using relations. Given
the necessary loss of logical identity and physical independence, I have
difficulty assuming such a goal exists until proved otherwise. While
mathematicians may someday supplant the relational model as the best
theory-based logical data model, I don't expect it to happen tomorrow.

We are talking about abandoning thousands of years of incremental
developments in the field of logic. Those sorts of things rarely ever happen
and certainly do not happen overnight.


Mikito Harakiri

unread,
Feb 18, 2003, 5:32:23 PM2/18/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2u9bg$50p$1...@slb2.atl.mindspring.net...

> Mikito Harakiri wrote:
>
> >This is just a series of random definitions, well spiced with curly
> >bracketed symbols. Or is there any wonderful theory about it? Where in
math
> >did you see "We have 10 unrelated alternative definitions, pick anyone
you
> >like"?
> Um, everywhere? Pick up half a dozen graph theory textbooks,
> and you're likely to see nearly as many definitions. Some analysis
> books define irrationals in terms of Dedekind cuts, others with
> limits.

Plus there is nonstandard analysis. Each of these is like a separate theory,
however.

Now when we write

(a,b)=def={{a},{a,b}}

what theory does it belong to? What can we deduce from this "definition"?
How does it relate to

(a,b)=def2={a,{a,b}}

for example?

> Riemann integrals get defined in all kinds of ways.

I know only one.

> The
> function f(z) = e^z has plenty of useful definitions.

Deduced from each other.

> Fortunately,
> mathematics is well-enough founded that the many alternate definitions
> of a construct are provably equivalent. I would be surprised to
> hear any mathematician suggest that the existence of many equivalent
> foundational definitions of a concept indicates that the concept is
> useless. There is no "one way" to define a mathematical idea using
> sets.
>
> To throw away all of mathematics because there are equivalent
> formulations of ideas makes as much sense as to throw away
> numbers because Europeans, Chinese, and Arabs write them
> differently.

I think we agree on all that. We disagree upon how "well founded" or "well
established" set reduction is.

Bob Badour

unread,
Feb 18, 2003, 5:44:17 PM2/18/03
to
"Steve Kass" <sk...@drew.edu> wrote in message
news:b2u8ht$9i0$1...@slb2.atl.mindspring.net...

If the facts are truly indistinguishable, they are a single fact and we have
a model based on sets. If they only represent multiplicities, one really has
a relational logical data model with an implicit count attribute.

What benefit does this implicit attribute provide?

>
> SK
>
> >
> >
>


Mikito Harakiri

unread,
Feb 18, 2003, 6:01:29 PM2/18/03
to

"Alfredo Novoa" <alf...@ncs.es> wrote in message
news:e4330f45.03021...@posting.google.com...
> > Why? View updates is about ability to solve equations with relational
> > operators. While one can certainly claim that Set Algebra is superior
to
> > Bag Algebra, you have to demonstrate what exact technical difficulties
are
> > solving equations in the Bag Algebra.
>
> The systematic approaches to the view update problem based on
> predicate logic does not work with bags.

I have a doubt about Date's approach. Is he treating each operation
individually; independently of each other? Is he treating each new tuple
inserted into a relation individually; unrelated to other tuples inserted
(in the same transaction)?

Returning to the example

select id, 'VOICE' type, voice phone
from contact
union
select id, 'FAX' type, fax phone
from contact

we can easily see that both assumtions aren't correct:

1. Sucessively applied operations can't be considered independently of each
other. In the above example union can't be considered independently of the
antiprojection (BTW, what is the correct term for adding a pseudocloumn?).
2. Two tuples inserted together into a derived relation correspond to a
single tuple insertion in the base relation. The approach based upon "single
tuple" translation is flawed.


Bernard Peek

unread,
Feb 18, 2003, 6:16:11 PM2/18/03
to
In message <b2uat1$f0t$1...@slb9.atl.mindspring.net>, Steve Kass
<sk...@drew.edu> writes

>Bernard,


>
> This isn't a matter of opinion. There is one determinant: "there are two
>employees named John Smith". There are many consequential
>truths, such as "there is at least one employee", "there is an employee
>whose first name is not Nancy", "there are at least two employees
>whose first and last names share a common letter of the alphabet.",
>and so on.
>
> I don't deny that it can be important to distinguish between two
>John Smiths.

That's not my argument. My argument is that there may be no need to
distinguish between two real-world objects, each of which is referenced
by a single record in a database. The relational model (and databases
based on it) require that a distinguishing key be created even if there
is none in the logical data structure.

I don't dispute that there are real pragmatic reasons for accepting that
deviation from the logical structure of the data. But as this is a
theory newsgroup I wanted to point out that this is a (minor) failing in
the relational model.

[...]

> I'm not redefining any words, but we have a fundamental
>difference in understanding logical vs. physical models. You are
>saying that the real-world scenario of books in a library must be
>represented by a logical model that does not keep track of an
>actual attribute of "book" (acquisition number), and then you
>blame the model for not being able to distinguish two identical
>books,

No, that's not my objection. My objection is that the relational model
declares that there must be a distinction, when there is no such
requirement in the real world.

It does makes the maths easier, and it makes the implementation much
easier.

Steve Kass

unread,
Feb 18, 2003, 6:30:03 PM2/18/03
to
Excellent question. Before the present discussion, I would have
said that the logical model of multisets is conceptually close to
some real-world models, and certainly a good logical model for
the "relaxed" relational model implemented by many "R"DBMS
products, where the distinguishing attribute of rows is hidden.

That said, there's also room for confusion about whether one
conceptualizes the logical model as representing bags that
can contain indistinguishable duplicates (imagine the bag is
constantly being shaken, so we can't even say "the John Smith
at the bottom of the bag"), or as representing an inventory of
distinct entities with the additional attribute of multiplicity.

These two conceptualizations of the logical model lead naturally
to two distinct physical implementations. The more natural one
(in my mind) of "bags" is the more problematic, because no
computer I've ever used has a 30 Gig "hard bag"--there will
necessarily be an additional attribute, hidden or not, that internally
distinguishes every item. The less natural one, using multiplicities,
is not as convenient for some operations, such as displaying the
contents of a multiset (with no apologies for the fact that an order
attribute is acceptable with regard to a result set, because the
display and the bag are two different things), but using multiplicities
does a better job of representing the logical model, given the
reality of actual physical computers.

I really don't know if that answers your question, and I'll gladly
admit that anything I conceptualize as a multiset can simply be
represented logical and physically as a set, with the addition of
the attribute of multiplicity. But just as a consistent theory of
and notation for real numbers are useful in spite of there being
a possible representation using only sets of (sets of ...} integers,
I think an algebra of multisets, to use in logical models and
to drive physical models, is useful, and it can be completely
well-defined and consistent.

SK

D Guntermann

unread,
Feb 18, 2003, 6:17:09 PM2/18/03
to

"Steve Kass" <sk...@drew.edu> wrote in message
news:b2u8ht$9i0$1...@slb2.atl.mindspring.net...
>
>
> Bob Badour wrote:
>
> >"Steve Kass" <sk...@drew.edu> wrote in message
> >news:b2rqnp$1tj$1...@slb2.atl.mindspring.net...
> >
> >
> >>Mikito Harakiri wrote:
> >>
> >>
> >>
> >>>"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
> >>>news:51d64140.03021...@posting.google.com...
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>OK suppose I have an employee "relation" which is a multiset.
> >>>>I have two employees called John Smith, in the same dept on the same
> >>>>salary.
> >>>>So my multi-relation contains two outwardly identical tuples:
> >>>>("John Smith", 10, 20000)
> >>>>("John Smith", 10, 20000)
> >>>>
> >>>>Now one of the John Smiths has a sex-change and becomes Jane Smith.
> >>>>
> >>>>How does the user update only one of the rows?
> >>>>Surely it's impossible because the two rows are only distinguished
> >>>>internally?
> >>>>
> >>>>
> >>>>
> >>I think you are confusing your employees with your model.
> >>If we have only the two employees named John Smith, in
> >>department 10 with salary 20000, and we are storing this
> >>information in a multiset, we have only _one_ fact, not two:
> >>"There are two John Smiths in department 10 with salary
> >>20000". Two employees, but only one fact.
> >>
> >>If we wish to change our data so that it now represents two
> >>facts:
> >>"There is one John Smith in department 10 with salary 20000".
> >>"There is one Jane Smith in department 10 with salary 20000".
> >>
> >>we do so without any difficulty. There are many ways to
> >>devise a physical representation of our multiset, and how we
> >>transform it depends what the representation is.
> >>
> >>
> >
> >Exactly. Multisets have little utility for logical models, because one
must
> >resort to physical locations or addresses to make any use of the
duplicates.
> >
> >
> >
> I'm afraid I disagree. Multisets have plenty of utility for logical
models
> of items with multiplicities.

Steve,

Could you provide a practical example that might help me in wrapping my mind
around the notion of utility for logical models of items with
multiplicities?

I'm afraid I'm having problems understanding how one can even have some
sense of determinancy of what constitutes a multiset in contrast to a set
without some implicit logical mapping to identity. In the mind's eye, the
very basis for contrasting a multiset from a set, or vice versa, is
dependent on our very notion of identity.

For example, if I see {1,1,1,1,1}, I would have a tendency to describe it as
a collection of integer 1 values with a cardinality of five. In the process
of synthesizing my description, I find that I implicitly assign cardinality
to each member even though set theory would reduce this to {1}. Thus, I can
distinguish between the two collections.

What am I missing?

I guess the root of my confusion lies in the fact that I don't see how we
relate to anything in the real world without trying to apply some notion of
identity in a logical sense.

As soon as you expect multisets to model
> something
> else, you're not going to have much luck. It sounds like you are
expecting
> the multiset to model the individuality of "the duplicates", and if you
> are, then
> you are thinking of the duplicates as separate facts, and
> distinguishable. Multisets
> don't model the "individual duplicates" as individual entities. If you
> want to do that,
> you need to use a set model and include a distinguishing attribute with
> the entity.
>

> SK
>
> >
> >
>


It is loading more messages.
0 new messages