Real world issue: How to "split" queries in presence of record replication and replication sensitive aggregate functions ?

pamela...@libero.it

unread,

Sep 14, 2006, 4:54:44 AM9/14/06

to

I 'd like to repropose my original problem which got lost among vain
fights
and pollution.

I am infact trying to write as an exercise a "reporting" program. Or
more
precisely a program which is capable to generate queries useful to make
reports.

This can be useful in OLAP systems as well as for classical dbms's.

I'd like to get 2 pieces of information:

1. First what is the relevant literature and how this problem is
called in the theory
2. Most importantly, I need the actual "algorithm", that is a
systematic way to produce
these queries

I report here my part of my original posting and some answer.

---
Assume you have a few tables. Assume that there are a
few relationships and that for instance if you have 2 table A, B in
1-N relationship, you have on table A defined some function that is,
say, "replication
sensitive", such as count (not count distinct) or sum. When whe make
some join of these tables and compute such functions we obtain an
incorrect calculation of the functions due to record replication.
Many softwares such as Business Objects are able to devise some union
of subquery to avoid that replication problem. Experiment show that
such software are usually able to deal with that if functions are on
"fact" table (say on the N side of the relationship), but they seem to
have big problems when the functions are applied on the dimension table

(say on side 1 of relationship).

Would you be able to suggest an optimal split/union algorithm for this
situation. I do have some "empirical" ideas on how to do that in
general,
but I would like to confront it with you.

---

I make a first example to start discussion. Let take 5 simple tables.
Let's
avoid talking about name, key issue or pathological design
issue. Let's just focus on the abstract reporting problem.

CLIENT
Name
U1
U2
U3

LINE_A
ItemA
A_1
A_2
A_3
A_4

LINE_B
ItemB
B_1
B_2
B_3

TRANS_A
Name ItemA Amount
U1 A_1 10
U1 A_3 20
U1 A_1 20
U1 A_3 10
U1 A_1 20
U3 A_2 10

TRANS_B
Name ItemB Amount
U1 B_1 20
U1 B_1 10
U1 B_3 10
U1 B_1 20
U1 B_2 10
U3 B_1 20

Assume one wants this simple report (given the above):

Name | ItemA | ItemB | TotalAmountItemA | TotalAmountItemB

One first question is what is the best way to create this kind of
result. And
for a general scheme which might incorporate similar problems what is
the best strategy to build a query that provides the expected info.
Other examples might include functions on "dimension" tables.

-----

Alexandr Savinov's reply

You can order your tables as follows:

LINE_A CLIENT LINE_B
\ / \ /
TRANS_A TRANS_B

If a table references another table then it is positioned below. It is
a
6-dimensinoal model (2 numeric dimensions for amounts are not shown).
If
you formally add bottom table then it will have the following
structure:

You see that two tables TRANS_A and TRANS_B are formally independent
because they do not have a common subtable. So in order to get the
desired result one can perform two steps:

1. Produce two aggregate queries for TRANS_A and TRANS_B by grouping
records and summing up the amount.

2. Combine them into one result set using the same client name as a
criterion. (However, I seriously doubt if such a result will be
meaningful.)

--------

Hi Alexandr ,

That's fine and correct. But I need the general logic (that's is what I

meant for "algorithm") to deal with *any* arrangement of tables and
relationship. Again my perspective is a programmer's one. One or more
istances of a problem are good for discussions and counterexamples, but

I need the general logic to deal with any set of "tables" (can be views

or whatever actually) and relationships.

The information I assume to be known is:

- table structure for each table
- relationships
- functions applied to each field and their properties(row function/
aggregate function / replication sensitive or not)

Then the "algorithms" (a systematic method to perform this task) must
be able to make automatically the partioning and union of subqueries,
according to some logic, which is the object of my question.

PS

> LINE_A CLIENT LINE_B
> \ / \ /
> TRANS_A TRANS_B

1. I am not sure to understand this notation. What do you mean by
"reference". You mean is in "relationship with" ?

Intuitively I see it as (I am "inventing" a representation)

m / 1 LINE_A
1 / m TRANS_A
CLIENT
1 \ m TRANS_B m \ 1
LINE_B

1/m means the kind of relationship (one to many).

On the observation that this may not be very easy to do in general,
I noticed that:

All decent reporting programs are able to deal with any database.
Since I was thinking to create as an exercise a reporting tool, I was
trying
to understand how to do that. Programs must be able to deal with
anything.

A relatively naive reporting program would probably spit out something

like:

SELECT
C.[Name] AS [Name],
sum(T1.Amount) AS [Tot_A],
sum(T.Amount) AS [Tot_B],
T.ItemB AS [ItemB],
T1.ItemA AS [ItemA]
FROM
Trans_A T1,
Client C,
Trans_B T
WHERE
C.[Name] = T1.[Name] AND
C.[Name] = T.[Name]
GROUP BY
C.[Name],
T.ItemB,
T1.ItemA;

where instead, if I understand your suggestion, it would be better to
give:

SELECT
C.[Name] AS [Name],
sum(T1.Amount) AS [Tot_A],
NULL AS [Tot_B],
NULL AS [ItemB],
T1.ItemA AS [ItemA]
FROM
Trans_A T1,
Client C
WHERE
C.[Name] = T1.[Name]
GROUP BY
C.[Name],
T1.ItemA;
UNION

SELECT
C.[Name] AS [Name],
NULL AS [Tot_A],
sum(T.Amount) AS [Tot_B],
T.ItemB AS [ItemB],
NULL AS [ItemA]
FROM
Trans_B T,
Client C
WHERE
C.[Name] = T.[Name]
GROUP BY
C.[Name],
T.ItemB;

or perhaps the possibility to execute the 2 subqueries separately

I need a systematic method to split set of relationships and make the
subqueries.

-P

FreeData

unread,

Sep 14, 2006, 8:28:55 AM9/14/06

to

Two options for you:
1. Make your dimension Type 1 which will remove duplicates.
2. Remove your measure from your dimension.

pamela...@libero.it

unread,

Sep 14, 2006, 8:59:02 AM9/14/06

to

FreeData ha scritto:

Hi FreeData, thanks!

> Two options for you:
> 1. Make your dimension Type 1 which will remove duplicates.

Sorry? Not sure to get what you mean.

> 2. Remove your measure from your dimension.

hmmm, You must assume a programmer's perspective. Not a user's.
The program must do what the user says not viceversa.

The terms of the problem are: given a set of tables and relationships
and given a set of fields, possibly equipped with aggregate function,
which result in the report, determine the correct split of
relationships and the subqueries that will yield the desired result.
You can see an implementation of that, for instance, in Business
Objects (the well know BI program) [I believe that implementation fails
under some conditions]

For instance in my example the first query is a wrong result. Report
computation is actually wrong. The second query (a UNION of 2
subqueries) should be the best result, given the user request).

-P

pamela...@libero.it

unread,

Sep 14, 2006, 10:59:12 AM9/14/06

to

[post here a copy of the reply for readers' convenience]

Hi Jan, thanks,

> Do you really want to repeat in
> it the information for each ItemB for a Name for each ItemA for that
> Name? Why so much redundancy?

It is a perfectly legal request a(ny) user can make. The user want to
list the items and the total amount. But the point is not this. The
point is to find a general method to deal with any request of report.
It's an abstract problems which is not related with semantic value of
items. It must only depend on table structures, relationships,
aggregate function properties and fields involved in report.

The point is not to discuss this specific example. Examples are uuseful
to fix ideas or find counterexamples. This is a "Theory" group. So I am
expecting abstraction and a systematic method to build these queries.

> I can of course give you the SQL statement to

You may have missed the preceding post where such a statement has
already been provided. For readers' convenience I have repacked this in
a new thread:

http://groups.google.it/group/comp.databases.theory/browse_frm/thread/538735631f141027?hl=it

Please respond there so that other can follow.

I >suggest you first give
> what you think is the content of your report in the example you gave.

That's the point. Content is unimportant. What matters are only formal
the properties of involved objects.
The input of the problem are:

- Tables (in abstract sense, as a list of field) and a list of their
field including datatype
- Relationships
- Structure (list of fields) of the report with properties of involved
functions:
- Function is aggregate: Yes/No
- Function is sensitive to replication (eg. sum): Yes/no

That's it. Simple to state.

-P

pamela...@libero.it ha scritto:

Alexandr Savinov

unread,

Sep 15, 2006, 4:55:12 AM9/15/06

to

pamela...@libero.it schrieb:

> That's the point. Content is unimportant. What matters are only formal
> the properties of involved objects.

If we want to get meaningful reports then content is a primary issue.
While the formal model is simply a concise encoding of what we expressed
in words. This is why to have a meaningful use case is so important.

> The input of the problem are:
>
> - Tables (in abstract sense, as a list of field) and a list of their
> field including datatype
> - Relationships
> - Structure (list of fields) of the report with properties of involved
> functions:
> - Function is aggregate: Yes/No
> - Function is sensitive to replication (eg. sum): Yes/no
>
> That's it. Simple to state.

I like your problem formulation. But in this form it hardly can be
solved. And the main reason is that providing "a list of fields" as a
report specification is in most cases ambiguous, i.e., it cannot be
translated into one query unambiguously.

A solution to your problem could be as follows:

1. Choose one *primary* table with entities listed in the report.

Notice 1.1 No fields are chosen yet from this and other tables.

Notice 1.2 If this table does not exist then it has to be somehow
defined, e.g., as a view or via query.

2. Define properties of the primary entities by indicating fields from
other tables.

2.1 Properties may have 3 major types: from supertables via dimension
(column) names, from subtables via inverse dimensions (columns taken in
the opposite direction), zigzag properties which combine the previous
two types.

Notice 2.2 Sometimes specifying a field is not enough and it is
necessary to provide a hint for choosing an appropriate access path.

Notice 2.3 An aggregation function can be applied to properties
returning a collection (see 2.1.2 and 2.1.3)

This approach can be more or less easily implemented but here I would
like to warn you that the main problem is not here. It appears when you
start imposing restrictions in your report (say, via WHERE and HAVING
clause in SQL). You may easily get cyclic dependencies which need to be
somehow resolved. But it is already another story...

>> -P
>

--
http://conceptoriented.com

pamela...@libero.it

unread,

Sep 15, 2006, 6:00:05 AM9/15/06

to

Alexandr Savinov ha scritto:

> pamela...@libero.it schrieb:

This is why to have a meaningful use case is so important.

Thanks Alexandr,

I like your suggestion and I want to work on this direction.

[About the meaningful part I still believe that is irrelevant. My
purpose is to implement something that gives the best output given a
certain input. It's up then to the user to judge if this makes sense or
not. But the program must be anyway work. It cannot be involved in
semantics.]

>
> > The input of the problem are:
> >
> > - Tables (in abstract sense, as a list of field) and a list of their
> > field including datatype
> > - Relationships
> > - Structure (list of fields) of the report with properties of involved
> > functions:
> > - Function is aggregate: Yes/No
> > - Function is sensitive to replication (eg. sum): Yes/no
> >
> > That's it. Simple to state.
>
> I like your problem formulation. But in this form it hardly can be
> solved. And the main reason is that providing "a list of fields" as a
> report specification is in most cases ambiguous, i.e., it cannot be
> translated into one query unambiguously.
>
> A solution to your problem could be as follows:
>
> 1. Choose one *primary* table with entities listed in the report.
>
> Notice 1.1 No fields are chosen yet from this and other tables.
>
> Notice 1.2 If this table does not exist then it has to be somehow
> defined, e.g., as a view or via query.

I have begun doing some attempts. I think the first step is to attain a
"partioning" of the relationships so that we are able to identify the
tables which will go in each subqueries.
The final subqueries clearly might have tables in common.

I was thinking to proceed as follows:

1. Identify all the tables which have at least one field in the report
which has applied on it a replication sensitive aggregate function and
that are, say, on side 1 of a 1-N relationship

1. For each ot the above tables
Start from that table and "navigate" recursively through
relationships 1-N until you find on side N a table which has a field (a
pure dimension) in the report. When you find it stop, cutting out that
table.

2. At this point you should have a partion of tables and relationships.
By construction, within this partition there cannot be issues due to
record replication: because a 1-N situation with an used field on side
N is not possible.

3. Now to build the subqueries,
for each partition
for each table in the partition which have at least one field
in the report which has applied on it a replication sensitive aggregate
function navigate recursively through relationship always in direction
N-1 until you cannot go farther (for instance you find 1-N) *allowing
to exit from partition*. Make a union all of these subqueries.

But what I cannot believe is that this problem has not been
systematically settled up by someone and we must be reinventing this
wheel. As it seems to me that is the most basic issue one finds out
when extracting data from a database. Isn't this a known problem in
theory with some name attached to it ? There must be literature
references somewhere ...

-P

Alexandr Savinov

unread,

Sep 15, 2006, 6:25:44 AM9/15/06

to

pamela...@libero.it schrieb:

> I have begun doing some attempts. I think the first step is to attain a
> "partioning" of the relationships so that we are able to identify the
> tables which will go in each subqueries.
> The final subqueries clearly might have tables in common.
>
> I was thinking to proceed as follows:
>
> 1. Identify all the tables which have at least one field in the report
> which has applied on it a replication sensitive aggregate function and
> that are, say, on side 1 of a 1-N relationship

What is a replication sensitive aggregation function?

> 1. For each ot the above tables
> Start from that table and "navigate" recursively through
> relationships 1-N until you find on side N a table which has a field (a
> pure dimension) in the report. When you find it stop, cutting out that
> table.
>
> 2. At this point you should have a partion of tables and relationships.
> By construction, within this partition there cannot be issues due to
> record replication: because a 1-N situation with an used field on side
> N is not possible.

How do you define a partition (of tables and relationships)?

What do you mean by "record replication"?

> 3. Now to build the subqueries,
> for each partition
> for each table in the partition which have at least one field
> in the report which has applied on it a replication sensitive aggregate
> function navigate recursively through relationship always in direction
> N-1 until you cannot go farther (for instance you find 1-N) *allowing
> to exit from partition*. Make a union all of these subqueries.

How many rows such a report will have (assuming you know the number of
rows in all the tables)? As far as I understand, the final result set is
a union so the number of record is a sum of all subqueries. How many
records returns each subquery?

Could you demonstrate how this procedure works using the following
example. There is a table of Companies and Products with many-to-many
relationship stored in CP table. If I select one field from Companies
and one field from Products (say, company_name and product_name), then
what report your procedure will generate?

Now let us make it more complex. Assume that I want to have a field with
the number of products produced by this company. How your procedure will
generate the report?

And finally, I want to add a field with the number of companies
producing this product (say, chips are produced by 5 companies while
beer by only 3 companies). What happens in this case?

> -P

--
http://conceptoriented.com

pamela...@libero.it

unread,

Sep 15, 2006, 8:35:06 AM9/15/06

to

Alexandr Savinov ha scritto:

Thanks. Let's define a common "protocol" to discuss and interact.

> What is a replication sensitive aggregation function?

Terminology
========

Please feel free to replace the following with the *right* english
terminology.

1.
For "aggregate" function (please correct terminology if wrong) I mean
something like:

sum()
avg()
count()
var()
or any user function that turns a bunch of values into a single scalar
...
This involves a group of record and requires a GROUP BY clause.

For "row function" I mean for instance a function defined
Income - Expenditure, Age * AgeWeight, ...
this is computed on each record. These functions are uninteresting for
our purpose. Do not give us any problem.

"aggregate function" is in some sense the opposite to "row function"

2.
For "Replication sensitive" I mean:

- An aggregate function such that if some value (operand) is
replicated the result comes out wrong.

For instance:

* Replication sensitive* :
sum()
count()
avg()
Any Quantile and so on
...

* Not Replication sensitive * :
max()
min()
count distinct ()
anything not influenced by the replication of values
...

> How do you define a partition (of tables and relationships)?

You have a partition when you separate the objects that are in a set
into several subsets that have no common items.

>
> What do you mean by "record replication"?

I mean, for instance when you join 2 tables in 1-N relationship, the
values on side 1 get "replicated" on the resulting table.

> Could you demonstrate how this procedure works using the following

Yes I will. But for ease of discussion and so that anyone here can
follow and make observations. I propose to take a free standard well
know database, easy to work with. I suggest to take Northwind for
ACCESS.

If you do not already have it it is a free download:

http://www.microsoft.com/downloads/details.aspx?FamilyID=C6661372-8DBE-422B-8676-C632D66C529C&displaylang=EN

Make your request referring to those tables.

If this represent a problem in any way, I will work with tables you
suggest. I will make a little access db. In such a case I need a
precise description of fields, relationships and the fields (with
possible functions) that have to be in the final report.

Number of records are irrelevant to our problems .

-P

David Cressey

unread,

Sep 15, 2006, 2:56:31 PM9/15/06

to

<pamela...@libero.it> wrote in message
news:1158224084....@d34g2000cwd.googlegroups.com...

I'm going to suggest that you take a look in a topic discussed in this
newsgroup a year or two ago, concerning how to report on pizza orders,
where each pizza ordered can have a set of cheeses (like mozarella, romano,
etc.) and an independent set of toppings (like pepperoni, mushroom, etc.)

There is a remarkable similarity between the pizza order scenario and the
example you provided us here to start the discussion.

Where you have a table named "Clients" the pizza example had a table named
"orders". Your "LINE A" and "LINE B" are like "CHEESES" and "TOPPINGS".

And your "TRANS A" and "TRANS B" Correspond to two multivalued fields in
the order record, that each provide a list of cheeses and toppings for
every order.

The only difference is the two fields called "amount", which didn't exist
in the pizza example. But that's a relatively trivial difference.

Dawn was kind enough to show us all how a simple multivalued file was able
to pull up a report very much like the one you outlined, with none of the
difficulties attendant on the use of the relational model and relational
joins. In particular, by avoiding joins, the MV model would almost surely
avoid the replication that your totals are sensitive to.

I am sure you and Dawn could carry on a spirited discussion on the subject
of MV as compared with RM. You seem to have a great deal in common.

You both have a sharp eye for keeping things simple for the programmer, and
on avoiding the inflexibilities that go with structure. That seems to be
the thing that us poor benighted data architects and theoreticians always
lose sight of, to hear programmers tell it.

I look forward to a lively and entertaining dialogue.

pamela...@libero.it

unread,

Sep 15, 2006, 4:23:06 PM9/15/06

to

David Cressey ha scritto:

> I'm going to suggest that you take a look in a topic discussed in this
> newsgroup a year or two ago, concerning how to report on pizza orders,
> where each pizza ordered can have a set of cheeses (like mozarella, romano,
> etc.) and an independent set of toppings (like pepperoni, mushroom, etc.)

Thanks David. I found it:
http://groups.google.it/group/comp.databases.theory/browse_frm/thread/fe25e4b0d870d71f/76b79e9027398f7d?lnk=gst&q=CHEESES+dawn&rnum=1#76b79e9027398f7d

but nor read it yet.

> There is a remarkable similarity between the pizza order scenario and the
> example you provided us here to start the discussion.

I would like to keep the matter abstract and use examples (only) to
validate or discard a possible systematic method to do what we want.
But on the end the method must work with any possible example.

> Dawn was kind enough to show us all how a simple multivalued file was able
> to pull up a report very much like the one you outlined, with none of the
> difficulties attendant on the use of the relational model and relational
> joins. In particular, by avoiding joins, the MV model would almost surely
> avoid the replication that your totals are sensitive to.

Ok. Let's proceed in an ordered manner. First I would like to solbe the
problem within a relational framework. After we have solved this first
basic problem, we might try to confront it with other approaches.

Actually I would like to put both in actual code. If interested I will
them send you or any other reader the code or the possible application.

>
> I am sure you and Dawn could carry on a spirited discussion on the subject
> of MV as compared with RM. You seem to have a great deal in common.

Who knows !? Perhaps he/she is reading!

> You both have a sharp eye for keeping things simple for the programmer, and
> on avoiding the inflexibilities that go with structure. That seems to be
> the thing that us poor benighted data architects and theoreticians always
> lose sight of, to hear programmers tell it.

Interaction is reciprocal enrichment. I am missing all the theory part,
and focused to get things work.
As a programmer, my continuos effort is to make thing simple (divide et
impera), in order to be able with increasing level of complexity.
Usually in programming simplicity is a (painful) achievement. But I am
only interested in solutions which result in a general implementation,
capable to dial with any possible request from the user and any
possible configuration of tables/relationships/output fields. Ah hoc or
specific solutions are of no interest. They are however useful to infer
the general method and to validate it.

> I look forward to a lively and entertaining dialogue.

Sure. Let's work on a common basis to avoid misunderstandings. I
suggested to take Northwind and make example reports with
aggregate-function-replication issues on it. But if you want to make
another schema just propose it. I will build my own little access DB to
make the experiments and try the algorithm.

It's important that be no ambiguity.

Thanks.

-P

David Cressey

unread,

Sep 15, 2006, 8:25:34 PM9/15/06

to

<pamela...@libero.it> wrote in message
news:1158351786.7...@m73g2000cwd.googlegroups.com...
>
> David Cressey ha scritto:

> > I look forward to a lively and entertaining dialogue.

>
> Sure. Let's work on a common basis to avoid misunderstandings.

Let me clarify. I look forward to watching a lively and entertaining
dialogue between you and Dawn.

Marshall

unread,

Sep 15, 2006, 9:17:09 PM9/15/06

to

pamela...@libero.it wrote:
>
> Please feel free to replace the following with the *right* english
> terminology.
>
> 1.
> For "aggregate" function (please correct terminology if wrong) I mean
> something like:
>
> sum()
> avg()
> count()
> var()
> or any user function that turns a bunch of values into a single scalar

Yes, these are commonly called aggregate functions.

> This involves a group of record and requires a GROUP BY clause.

It's the other way around, actually: a GROUP BY requires an
aggregate function, but an aggregate function does not require
a GROUP BY.

> [...]

> "aggregate function" is in some sense the opposite to "row function"

"Opposite" is a poor choice of words. "Dual" maybe. But even
that's not correct.

> 2.
> For "Replication sensitive" I mean:
>
> - An aggregate function such that if some value (operand) is
> replicated the result comes out wrong.
>
> For instance:
>
> * Replication sensitive* :
> sum()
> count()
> avg()
> Any Quantile and so on
> ...
>
> * Not Replication sensitive * :
> max()
> min()
> count distinct ()
> anything not influenced by the replication of values

The term is idempotent. If the binary form of the aggregate
function is idempotent, the aggregate will return the same value
even if values are repeated arbitrarily. Since + is not idempotent,
sum() is "sensitive" to repeated values. Since binary min *is*
idempotent, aggregate min() is not "sensitive" to repeated
values.

> Yes I will. But for ease of discussion and so that anyone here can
> follow and make observations. I propose to take a free standard well
> know database, easy to work with. I suggest to take Northwind for
> ACCESS.

Actually that's not a good idea. When discussing this sort of question,
(or most any sort of question, really) it is best to start with
*minimal* examples. You want to have the fewest number
of tables and the fewest number of attributes to demonstrate
the issue. Since you're quite concerned about generality, there
is also the issue of whether the minimal example will fully
generalize. But if you've chosen your minimal example well,
it often will. There's no need to haul the entirety of Northwind
into the conversation. So far I've avoided jumping in on the
conversation because I've been unable to disentagle your
complicated posts.

Anyway, to the extent that I understand your question, the
answer is either trivial or impossible as the case may be.
If you have an idempotent aggregate, it's a non-issue.
If you have a non-idempotent aggregate, and you apply a
transformation to a relation but wish to know the value
of the aggregate before the transformation, then either
your transformation is reversible, in which case you reverse
the transformation and apply the aggregate, or else
the transformation is not reversible, in which case it's
impossible.

If I haven't understood your question correctly, it would
be nice if you could construct a *minimal* example for
your question, and we could start from there.

Marshall

Bob Badour

unread,

Sep 15, 2006, 11:21:05 PM9/15/06

to

Marshall wrote:

> pamela...@libero.it wrote:
>
>>This involves a group of record and requires a GROUP BY clause.
>
> It's the other way around, actually: a GROUP BY requires an
> aggregate function, but an aggregate function does not require
> a GROUP BY.

I doubt very many if any dbmses require an aggregate function for a
group by. Without any aggregates, group by is synonymous with project.

If that choice indicates where the ignorant learned about database
theory, that explains a lot.

> Actually that's not a good idea. When discussing this sort of question,
> (or most any sort of question, really) it is best to start with
> *minimal* examples. You want to have the fewest number
> of tables and the fewest number of attributes to demonstrate
> the issue. Since you're quite concerned about generality, there
> is also the issue of whether the minimal example will fully
> generalize. But if you've chosen your minimal example well,
> it often will. There's no need to haul the entirety of Northwind
> into the conversation. So far I've avoided jumping in on the
> conversation because I've been unable to disentagle your
> complicated posts.

Kinda like Dijkstra's polemic against diagrams as self-defeating crutches.

Cimode

unread,

Sep 16, 2006, 5:49:57 AM9/16/06

to

pamela...@libero.it wrote:

> Ok. Let's proceed in an ordered manner. First I would like to solbe the
> problem within a relational framework.

I doubt you can solve anything since you have no clue what would
actually consitute a *relational framework*...Else you would not have
pulled so many bullshits in all the threads you appeared into (Are'nt
you hungry? How about a *relational sandwich* since this word is used
with so much assurance by you and with so little knowledge of what it
implies)

And Nope, neither *Northwind* nor *Access* won't help you do anything
about it...You must do some reading if you want to use terms as
*relational* without perpetually redefining their meaning to satisfy
your ego....Here, read and shut up

Introduction to Databases by CJ DATE....
http://www.amazon.com/Introduction-Database-Systems-Eighth/dp/0321197844/sr=8-2/qid=1158399965/ref=sr_1_2/102-9704023-0646539?ie=UTF8&s=books

Practical Issues in Data Management F PASCAL
http://www.amazon.com/exec/obidos/ASIN/0201485559/databasede095-20?creative=327641&camp=14573&adid=0EVVSZ5KA9TKB3D465P3&link_code=as1

pamela...@libero.it

unread,

Sep 16, 2006, 6:17:06 AM9/16/06

to

Marshall ha scritto:

> The term is idempotent. If the binary form of the aggregate
> function is idempotent, the aggregate will return the same value
> even if values are repeated arbitrarily. Since + is not idempotent,
> sum() is "sensitive" to repeated values. Since binary min *is*
> idempotent, aggregate min() is not "sensitive" to repeated
> values.

I have been reading the definition of idempotence
http://en.wikipedia.org/wiki/Idempotent

it says:
A unary operation (i.e., a function), is idempotent if, whenever it is
applied twice to any element, it gives the same result as if it were
applied once: f maps X to X
f is an idempotent function: f(f(x)) = f(x) for all x,

In my case I mean something different (the function is not unary,
aggregate function
works on several values). Example:

count distinct ({ 1,2,5, 7,7,7,7,7, ... }) does not change if I
replicate 7 any number of times
for sum or count or avg... this replication does change the result.

So far I cannot see that the above is the same as the idempotence
definition I read on wiki. But please if you have a definition that
shows that please let me know. I do want to use the right term for this
concept.

Your statement "If the binary form of the aggregate function is

idempotent, the aggregate will return the same value even if values are

repeated arbitrarily" is not wrong, but is not equivalent to my
definition of a "replication sensitive function (RS)".

Your statement is the same as to say:

Idempotence ==> not replication sensitive

But it does not imply the other way round. I am saying that we need
double implication for equivalence: <==> . So you statement is true
but does not seem to be the definition of what I am calling a
"replication sensitive function".

It's important we clarify terminology to be sure we are talking of the
same thing. Please
feel free to correct me if I am wrong.

> If I haven't understood your question correctly, it would
> be nice if you could construct a *minimal* example for
> your question, and we could start from there.

Ok I made a proposal before and that could be used. If you want to make
one
or more please do. I would be happy to try to apply my empirical idea
to that
so that you can eventually say that it makes sense or not to you.

> Anyway, to the extent that I understand your question, the
> answer is either trivial or impossible as the case may be.

What I am saying is that when there are RS such functions as outcome
of a report, there is sometimes (depending on the relationships) the
need to split the query in a Union of subqueries, because the presence
of replicated values may alterate the result.

All leader reporting softwares are capable to generate such
subqueries. I am looking for a systematic method to generate those
subqueries. I am not inventing the problem: there is no doubt that the
problem is there. I just want to understand how to provide the answer.

Nature of data is unimportant, semantic is unimportant, "right" design
is unimportant. The only thing that matter for this problem is to be
able to provide a *formally* correct report, whatever the design is.
Formally correct mean that computation of functions will be right and
not wrong due to value replication. Whatever is the design, wrong or
right that might be. It's actually * better if you propose highly
pathological designs * because there we can test better our logic.

In the example I made above, you see that thw first query (that would
for instance be generated by access gives a wrong result. The second
one, which would be generated by Business Object, seems a right
answer).

-P

pamela...@libero.it

unread,

Sep 16, 2006, 6:24:41 AM9/16/06

to

Bob Badour ha scritto:

> Marshall wrote:
>
> > pamela...@libero.it wrote:
> >
> >>This involves a group of record and requires a GROUP BY clause.
> >
> > It's the other way around, actually: a GROUP BY requires an
> > aggregate function, but an aggregate function does not require
> > a GROUP BY.
>
> I doubt very many if any dbmses require an aggregate function for a
> group by. Without any aggregates, group by is synonymous with project.
>

Let's drop this detail that is unimportant for the main problem..

> If that choice indicates where the ignorant learned about database
> theory, that explains a lot.
>

It doesn't. I never learned it.

-P

pamela...@libero.it

unread,

Sep 16, 2006, 6:27:56 AM9/16/06

to

> Introduction to Databases by CJ DATE....

thanks for the commercial chris. I was missing them.

David Cressey

unread,

Sep 16, 2006, 10:18:16 AM9/16/06

to

<pamela...@libero.it> wrote in message
news:1158224084....@d34g2000cwd.googlegroups.com...

> I 'd like to repropose my original problem which got lost among vain
> fights
> and pollution.
>
> I am infact trying to write as an exercise a "reporting" program. Or
> more
> precisely a program which is capable to generate queries useful to make
> reports.
>
> This can be useful in OLAP systems as well as for classical dbms's.
>
>
> I'd like to get 2 pieces of information:
>
> 1. First what is the relevant literature and how this problem is
> called in the theory
> 2. Most importantly, I need the actual "algorithm", that is a
> systematic way to produce
> these queries

[great big snip]

Given that any body of data whatsoever can be exressed in a suitable schema
of tables.
And given that any body of data whatsoever can be expressed in a suitable
report.

It follows that, in order to create a completely general query creator, you
need to produce an algorithm that will produce an algorithm for any given
required transformation from the supplied inputs into the required outputs.
In order to come up with such an algorithm in full generality you are going
to have to solve the general problem of automatic programming.

Such a project is considerably more ambitious than "writing a reporting
program as an exercise".

I suggest you define the scope of what you are trying to achieve more
precisely.

Marshall

unread,

Sep 16, 2006, 11:15:25 AM9/16/06

to

pamela...@libero.it wrote:
> Marshall ha scritto:
> > The term is idempotent. If the binary form of the aggregate
> > function is idempotent, the aggregate will return the same value
> > even if values are repeated arbitrarily. Since + is not idempotent,
> > sum() is "sensitive" to repeated values. Since binary min *is*
> > idempotent, aggregate min() is not "sensitive" to repeated
> > values.
>
> I have been reading the definition of idempotence
> http://en.wikipedia.org/wiki/Idempotent
>
> it says:
> A unary operation (i.e., a function), is idempotent if, whenever it is
> applied twice to any element, it gives the same result as if it were
> applied once: f maps X to X
> f is an idempotent function: f(f(x)) = f(x) for all x,

Yes, I was concerned that someone might confuse binary
idempotence with unary idempotence, which is why I specifically
used the word "binary" twice in the above paragraph. I'm afraid
you'll need to reread the wikipedia entry with this in mind.

> In my case I mean something different (the function is not unary,
> aggregate function
> works on several values).

You're also going to need to understand how binary functions
are used to define aggregate functions. Once you understand
how binary + relates to sum(), and how min(x,y) relates to
aggregate min(), only then will you be able to understand
what I'm talking about, and why the correct term for what
you're describing is "idempotence."

> In the example I made above, you see that thw first query (that would
> for instance be generated by access gives a wrong result. The second
> one, which would be generated by Business Object, seems a right
> answer).

I'm sorry, but I'm not willing to work through your overly
complicated example to try to figure out how to answer
your question. If you can produce a minimal example,
I would look at it. Maybe something with one table
and two column, and maybe two or three rows. Start
small, then work up. Don't start big and complicated;
that just leads to miscommunication and wasted effort.

Marshall

pamela...@libero.it

unread,

Sep 16, 2006, 11:36:39 AM9/16/06

to

Marshall ha scritto:

. I'm afraid
> you'll need to reread the wikipedia entry with this in mind.

Thanks, I will.

>
> I'm sorry, but I'm not willing to work through your overly
> complicated example to try to figure out how to answer
> your question. If you can produce a minimal example,
> I would look at it. Maybe something with one table
> and two column, and maybe two or three rows.

Ok I will produce a couple of simple examples using Northwind, which
is quite common so that we do not have to do a scheme ex novo.

I will keep simple (Just 1 table is useless, because in such a case
no value duplication can occur) .

For table relationships see: Northwind

Pathological examples (plain join would give wrong reports)

[report may make *no sense*: I am just focusing on causing
value duplication within the join. Function can be substituted with
any
other function that is influenced by value "duplication", ok
say not IDEMPOTENT] the goal is to deal consistently with
any report request, even if it may not meaningful as to
its practical meaning.

Report 1
=========
Products.ProductID
[Order Details].Quantity
Orders.OrderID
avg([Order Details].UnitPrice)
count(Orders.Freight)
where
Function1 = avg([Order Details].UnitPrice)
Function2 = count(Orders.Freight)

Report 2
=========
Products.ProductName | Function1 | [Order Details].UnitPrice

where
Function1 = sum(Products.UnitsInStock)

queries which give possible "wrong" result
(function computation may be wrong) :

SELECT
Products.ProductID AS "ProductID",
[Order Details].Quantity AS "Quantity",
Orders.OrderID AS "OrderID",
avg([Order Details].UnitPrice) AS "Function1",
count(Orders.Freight) AS "Function2"
FROM
[Order Details],
Products,
Orders
WHERE
Products.ProductID = [Order Details].ProductID AND
Orders.OrderID = [Order Details].OrderID
GROUP BY
Products.ProductID,
[Order Details].Quantity,
Orders.OrderID

SELECT
Products.ProductName AS "ProductName",
sum(Products.UnitsInStock) AS "Function1",
[Order Details].UnitPrice AS "UnitPrice"
FROM
[Order Details]
INNER JOIN
Products
ON Products.ProductID = [Order Details].ProductID
GROUP BY
Products.ProductName,
[Order Details].UnitPrice

>
> Marshall

pamela...@libero.it

unread,

Sep 16, 2006, 11:45:22 AM9/16/06

to

David Cressey ha scritto:

> Such a project is considerably more ambitious than "writing a reporting
> program as an exercise".

Not scared about that :)

> I suggest you define the scope of what you are trying to achieve more
> precisely.

Here it is: ==> I want to focus only on the logic necessary to "split"
a join into a Union of select, when istead not splitting it would cause
a wrong computation (due to the presence of duplicated values) of some
functions which appear in the report.

The other cases are trivial. Making a query generator for the other
cases is immediate (to me).

-P

Marshall

unread,

Sep 16, 2006, 12:37:44 PM9/16/06

to

pamela...@libero.it wrote:
> Marshall ha scritto:
> . I'm afraid
> > you'll need to reread the wikipedia entry with this in mind.
>
> Thanks, I will.

You are entirely welcome.

> Ok I will produce a couple of simple examples using Northwind, which
> is quite common so that we do not have to do a scheme ex novo.
>
> I will keep simple (Just 1 table is useless, because in such a case
> no value duplication can occur) .

I'm sorry, but you're still no there yet. If you can't do it with one
table, then try two. If you can't do it with two columns, try
three. The example really needs to be as small as possible.
When I say "as small as possible" I literally mean that the
removal of even one table, or one column, or even just one
row would not demonstrate the issue any more.

> For table relationships see: Northwind

No thanks. If you use Northwind, you're making me have
to go find it, then find a dbms that can load it, then load it,
then look at it, before I can even parse your question.
I'm squeezing question-answering on this forum in between
raising kids, having a full time job, etc. and that's just too
much work. This is why asking the minimal question is so
important. Don't make me go do extra work just so I can
answer your question.

The following is quite blunt, but it could help you a lot:

How To Ask Questions The Smart Way

http://www.catb.org/~esr/faqs/smart-questions.html

Marshall

pamela...@libero.it

unread,

Sep 16, 2006, 4:00:53 PM9/16/06

to

Marshall ha scritto:

> When I say "as small as possible" I literally mean that the
> removal of even one table, or one column, or even just one
> row would not demonstrate the issue any more.

Here one small example I can think of.
The minimum number of table is probably 3 because:
1 table: it's impossible to duplicate values
2 tables in 1-N relationship: placing a function on the dimension table
would probably be too artificious an example

3 tables
--------------

ONE
|
TRANS
|
SUBTRANS

Table: ONE
Name
David
Marshall

Table: TRANS
Name Item Fact
David IT2 1
Marshall IT1 1
Marshall IT1 1

Table: SUBTRANS
Item SubFact
IT1 1
IT2 1
IT2 1
IT2 1

Report:
One.[Name] AS [Name],
count(Trans.Fact) AS [FactCount],
count(SubTrans.SubFact) AS [SubFactCount]

David has 1 fact and Marshall 2 facts.

"questionable" select:

SELECT
One.[Name] AS [Name],
count(Trans.Fact) AS [FactCount],
count(SubTrans.SubFact) AS [SubFactCount]
FROM
One,
Trans,
SubTrans
WHERE
One.[Name] = Trans.[Name] AND
Trans.Item = SubTrans.Item
GROUP BY
One.[Name];

>
>
> The following is quite blunt, but it could help you a lot:
>
> How To Ask Questions The Smart Way
>
> http://www.catb.org/~esr/faqs/smart-questions.html

quite massive. I will be spending the night with it :) : may save some
in future ...
Thanks

-P

> Marshall

Marshall

unread,

Sep 16, 2006, 4:21:03 PM9/16/06

to

I note that both TRANS and SUBTRANS have duplicate
rows. Did you know that that is a really bad practice?

> Report:
> One.[Name] AS [Name],
> count(Trans.Fact) AS [FactCount],
> count(SubTrans.SubFact) AS [SubFactCount]
>
>
> David has 1 fact and Marshall 2 facts.
>
> "questionable" select:
>
> SELECT
> One.[Name] AS [Name],
> count(Trans.Fact) AS [FactCount],
> count(SubTrans.SubFact) AS [SubFactCount]
> FROM
> One,
> Trans,
> SubTrans
> WHERE
> One.[Name] = Trans.[Name] AND
> Trans.Item = SubTrans.Item
> GROUP BY
> One.[Name];

So your question is how to handle the repetition of
values from TRANS when aggregating over a column
in TRANS with a non-idempotent aggregate?

What is wrong with "don't do that" as the answer?
Or do the aggregate before you join the third table.

Marshall

pamela...@libero.it

unread,

Sep 16, 2006, 5:50:41 PM9/16/06

to

Marshall wrote:
>
> I note that both TRANS and SUBTRANS have duplicate
> rows. Did you know that that is a really bad practice?

ok let's change to:

Item SubFact
IT1 1
IT2 2
IT2 3
IT2 4

Name Item Fact
David IT2 1

Marshall IT1 2
Marshall IT1 3

[records are unimportant. In any case, in practice there can be
transactions where one buys the same item and spend the same money.
Here it "seems" record are equal because field key for trans are
missing. But that's irrelevant to our purpose. I wanted to avoid
discussion about contents because really immaterial here]

> So your question is how to handle the repetition of
> values from TRANS when aggregating over a column
> in TRANS with a non-idempotent aggregate?

The duplication I am talking about is the one caused by joining tables
in 1-N relationship. When you make the join ONE-TRANS-SUBTRANS the
values in TRANS are replicated due to the relationship 1-N with
SUBTRANS . So (non idempotent) functions computed on TRANS consider the
same values more times they should (duplicated values). This yeld a
wrong result in the report for these functions.

Smart software is perfectly capable to split this query in subqueries,
as also Alexander pointed out in the previous post. My original
question was:

1. how is this problem called in the theory (are there references to
study?)
2. what is the algorithm to do that (or is it some kind of industry
secret that only few leader softwares implements?)

> What is wrong with "don't do that" as the answer?

That's not allowed :)) (It's allowed to have an algorithm that warn
the user about some possible problem in the design)

I hope I finally succeeded to explain what I am talking about :)

-P

Marshall

unread,

Sep 16, 2006, 6:40:37 PM9/16/06

to

pamela...@libero.it wrote:
>
> [records are unimportant. In any case, in practice there can be
> transactions where one buys the same item and spend the same money.

In practice, those two invoices will have different invoice ids.

> Here it "seems" record are equal because field key for trans are
> missing. But that's irrelevant to our purpose. I wanted to avoid
> discussion about contents because really immaterial here]

It's okay to leave some stuff out, but it's not okay to leave
essential things out. The fact that sets do not have duplicates
is essential, from both a practical and theoretical perspective.

> > So your question is how to handle the repetition of
> > values from TRANS when aggregating over a column
> > in TRANS with a non-idempotent aggregate?
>
> The duplication I am talking about is the one caused by joining tables
> in 1-N relationship.

A minimal example to demonstrate this, with two tables and four columns
and three rows:

S(x,y)
-----
1, a
2, a

T(y, z)
-----
a, b

S join T(x, y, z)
----------------
1, a, b
2, a, b

b appears twice in S join T but it only appears once in T.

(This can also be illustrated with two tables, one column each,
one row each.)

> When you make the join ONE-TRANS-SUBTRANS the
> values in TRANS are replicated due to the relationship 1-N with
> SUBTRANS . So (non idempotent) functions computed on TRANS consider the
> same values more times they should (duplicated values).

Yes. The result of count(z) from T will not be equal to count(z) from
(S join T).

> This yeld a wrong result in the report for these functions.

Whether this is wrong or not depends on what you want. If
you want an aggregate for the unjoined values of T.z, then
do the aggregate. Don't take the aggregate over some other
set of numbers.

> Smart software is perfectly capable to split this query in subqueries,
> as also Alexander pointed out in the previous post.

If you want to split S join T into subqueries, the first subquery
is "select * from S" or simply "S" and the second subquery is "T".

> My original question was:
>
> 1. how is this problem called in the theory (are there references to
> study?)

This is not a problem. This is what the operator does. It's
what it is supposed to do.

> 2. what is the algorithm to do that (or is it some kind of industry
> secret that only few leader softwares implements?)

Which "that" are you asking about? Join?

> > What is wrong with "don't do that" as the answer?
>
> That's not allowed :)) (It's allowed to have an algorithm that warn
> the user about some possible problem in the design)

The language can't read the user's mind. If the user asks
a perfectly valid and well-formed question of the system,
the system has no way of knowing that the user "really
meant" something else.

Marshall

pamela...@libero.it

unread,

Sep 17, 2006, 4:26:42 AM9/17/06

to

<>

Right Marshall , I replied to you instead of the group. Was using
another computer and got mixed up.

To sum up, the problem makes sense when you have real world joins with
several tables and fields involved. There you can have duplication of
record because 1-N relationships caused by the choice (for the report)
of fields that belong to "distant" tables. So in other words, this kind
of "problem" might occur and one (certainly not a very expert users)
may not be really aware of it. Smart tools which make queries for
reports, where these aggregate function which are affected by value
duplication are involved, usually do not allow this kind of
duplication, or in any case give the possibility to choose. What they
do, when possible, in an automatic splitting (if appropriate and
possible) of the original query in a union of subqueries (can be
several), in order to make sure that the computation of these function
is not wrong.

My original question was about a systematic method (i was calling it
"algorithm") to create this union of subqueries (I was thinking that
there is an underlying partitioning of tables and relationships, as in
the recursive logic I tried to sketch above). Systematic method means
to me that it would be capable to formally analyze and possibly split
(if necessary) any query (records are not concerned: only fields,
relationships and functions).

here is the reply I accidentally emailed:

On 9/16/06, pamela...@libero.it <pamela...@libero.it> wrote:
>
> Marshall 작성:

>
> > A minimal example to demonstrate this, with two tables and four
> > columns and three rows:
>

I am glad we are finally tuned.

>
> > Whether this is wrong or not depends on what you want. If you want
> > an aggregate for the unjoined values of T.z, then
>

That is obvious.

> > > Smart software is perfectly capable to split this query in
> > > subqueries, as also Alexander pointed out in the previous post.
> >
> > If you want to split S join T into subqueries, the first subquery is
> > "select * from S" or simply "S" and the second subquery is "T".
>

Ok but the programmer's perspective is that of dealing with dozen or
hundreds tables and hundreds or thousands field . For instance SAP
has over half million field in it. For simple cases there is no
necessity of reporting software. In real world value duplication in
big queries can occur without the user can see it. In such cases you
may get wrong reports, and you might get fired. I agree that this can
also be due due to poor design, but it is always helpful to have a
powerful automatic tool that helps you out.

Modern reporting software is capable to detect that and present the
user all possible choices. It is able to split the whole set of tables

and relationship in subsets and build all the various possible unions
of subqueries.

Then the user can see what is going on and in case even redesign his
report or the table schema.

> > > My original question was:
> > >
> > > 1. how is this problem called in the theory (are there
> > > references to
> > > study?)
> >
> > This is not a problem. This is what the operator does. It's what it
> > is supposed to do.
>

The problem is to make aware the user, by presenting him all the
possible way to split & partition the query.

> > > 2. what is the algorithm to do that (or is it some kind of
> > > industry secret that only few leader softwares implements?)
> >
> > Which "that" are you asking about? Join?
>

The systematic method to build the subqueries which avoid to have
wrong result due to data replication, as for instance Business Objects

does [http://www.businessobjects.com/] and as you can see from their
revenue they are not exactly the last in the field.

> >
> > > > What is wrong with "don't do that" as the answer?
> > >
> > > That's not allowed :)) (It's allowed to have an algorithm that
> > > warn the user about some possible problem in the design)
> >
> > The language can't read the user's mind. If the user asks a
> > perfectly valid and well-formed question of the system, the system
> > has no way of knowing that the user "really meant" something else.
>

No, but may make aware and present alternatives.

> >
> > Marshall
>

Thanks, time for me to go sleep. See you guys tomorrow.

/P

Alexandr Savinov

unread,

Sep 17, 2006, 6:43:13 AM9/17/06

to

pamela...@libero.it schrieb:

You actually have at least two problems:

1. One is that you need to write a query that produces a correct result
set. Repeated or replicated records is not a problem as well as
aggregate functions which are sensitive or not sensitive. So the
systematic method is just to write a correct query which returns
precisely what you expect. In your examples one approach consists in
using *nested* queries. An attempt to use a "concise" form like this one

One.[Name] = Trans.[Name] AND Trans.Item = SubTrans.Item

is naive and will not produce what you need. Unfortunately, SQL is not
very suitable for this task. I find it easier and more "systematic" if
you formulate one concrete task:

- how to get all records from a direct or indirect subtable (SUBTRANS in
your example) referencing a selected subset of records from the main
table (ONE in your example)

It is one of the main procedures needed for analytical reports and if
you have it (if you can implement it) then everything else is more or
less easy. You simply write:

REPORT (primary table: ONE)
Field 1: Name
Field 2: count( deproject("TRANS->Name","Fact") )
Field 3: count( deproject("TRANS->Item->Name","Fact") )

Here function deproject takes two arguments:
- a string starting with the name of some subtable followed by a
sequence of column names leading to the main table.
- column name from the subtable

2. The second problem is about translating a user-friendly specification
of the report to its formal specification. In particular, you wanted to
translate a list of arbitrary fields taken from any table to a formal
specification of the expected (meaningful) result. (How to express this
formal specification in SQL or any other language is already another
problem.) This problem already can be characterized as a theoretical and
requiring a systematic approach. In contrast to the first problem above
it is more difficult and cannot be solved by simply encoding something
in SQL.

> -P
>

--
http://conceptoriented.com

pamela...@libero.it

unread,

Sep 17, 2006, 6:44:33 AM9/17/06

to

>>Marshall ha scritto:

Let's use this notation

F1, F2, ... any table such that at least 1 field is involved in what

you call a "nonidempotent" function, i.e., a function
that
is affected by possible value duplication (count, sum,
avg,...)

A1, A2, ... any other table

(Intuitively: F for forbid duplication, A for allow duplication).
(I am still clear if I should also use a symbol for tables with no
fields involved
in the report, but now I think that is probably useless.)

Let also use the notation used by Alexander for 1-N relationsip

X
\ means X -< Y (one to many relationship)
Y

See a couple of output of a possible automated method.

EXAMPLE 1. Pizza
----------------

Consider for instance

A1 A2 A3
\ / \ /
F1 F2

Output:
in this case the query will be:

A1 A2 A2 A3
\ / union \ /
F1 F2

EXAMPLE 2. subtrans (pathological)
-------------------

A1
\
F1
\
F2

Output:
in this case there is no union that can make it
"safely" there are 2 distinct "partitions" P1, P2 :

P1: P2: F2
A1
\
F1

with a "cutting" relationship: F1
\
F2

possible queries in addition to

A1
\
F1
\
F2

which might yeld wrong results, would only be:

A1
\
F1
and

F2

The presence of the "partition" should rise a possible
alert to the user [this is the same as your "what about not to do it"]

The user should analyze the situation and decide what to do

By "systematic" method I mean a technique to give this kind of outputs
*no matter how complicate is the pattern of tables and relationships *

I have some intuitive idea on how to do that..., but I would like to
discuss it ...

-P

Alexandr Savinov

unread,

Sep 17, 2006, 6:52:19 AM9/17/06

to

pamela...@libero.it schrieb:

> The duplication I am talking about is the one caused by joining tables
> in 1-N relationship. When you make the join ONE-TRANS-SUBTRANS the
> values in TRANS are replicated due to the relationship 1-N with
> SUBTRANS . So (non idempotent) functions computed on TRANS consider the
> same values more times they should (duplicated values). This yeld a
> wrong result in the report for these functions.

Some comments. The result is not wrong - your query returns precisely
what it has to return. And records are repeated NOT because of the
existence of 1-N relationship - they occur because your query produces
them. In this sense your problem consists in finding a way how to
express things in SQL.

> Smart software is perfectly capable to split this query in subqueries,
> as also Alexander pointed out in the previous post. My original
> question was:
>
> 1. how is this problem called in the theory (are there references to
> study?)
> 2. what is the algorithm to do that (or is it some kind of industry
> secret that only few leader softwares implements?)

I think that it is not a theoretical problem but the problem of
implementation, i.e., how to implement certain logic. It seems that you
know what kind of result you want to get and the only difficulty is in
writing the corresponding SQL query. I do not say that it is simple. I
see several alternatives how it can be encoded in SQL including nested
queries and conventional WHERE restrictions. Yet the main difficulty is
in the performance of such queries. For a real report such a query may
take many pages and the ability to optimize it can be very important.

> -P
>

http://conceptoriented.com

pamela...@libero.it

unread,

Sep 17, 2006, 7:58:25 AM9/17/06

to

Alexandr Savinov ha scritto:

> pamela...@libero.it schrieb:

> Some comments. The result is not wrong - your query returns precisely

> what it has to return. And records are repeated NOT because of the
> existence of 1-N relationship - they occur because your query produces
> them. In this sense your problem consists in finding a way how to
> express things in SQL.

Yes I agree: this is exacly what I meant. The duplication "on side 1"
is naturally created during the join process due to the relationship
1-N. But in a "chain" of joins
if you have some sum, count, ... functions on a side 1 you may get
a result that *may or may not* be "correct" (depending on your
"interpretation"
of the function). My aim was to devise a general way to produce these
queries
in such a way to be able to make the user aware of this possible
problems and
to provide him with the various alternatives, including the various
join of subqueries
when necessary.

This is what reporting software usually does.

>
> I think that it is not a theoretical problem but the problem of

The part I am talking about is the theoretical one. About a general
method to create the queries and subqueries. Clearly, this is usually
destined to be implemented, although I guess if could be performed
manually on small designs.

> implementation, i.e., how to implement certain logic. It seems that you
> know what kind of result you want to get and the only difficulty is in
> writing the corresponding SQL query. I do not say that it is simple. I
> see several alternatives how it can be encoded in SQL including nested
> queries and conventional WHERE restrictions. Yet the main difficulty is
> in the performance of such queries. For a real report such a query may
> take many pages and the ability to optimize it can be very important.

Yes That's actually one of the many problems I have in mind and that I
found out when coding these things, like also determining the optimal
permutation of tables in a join under some constraints , etc...

But I do not dare to ask...

>
> > -P
> >
>
> http://conceptoriented.com

pamela...@libero.it

unread,

Sep 17, 2006, 8:36:37 AM9/17/06

to

Alexandr Savinov ha scritto:

> pamela...@libero.it schrieb:

> Some comments. The result is not wrong - your query returns precisely

> what it has to return. And records are repeated NOT because of the
> existence of 1-N relationship - they occur because your query produces
> them. In this sense your problem consists in finding a way how to
> express things in SQL.

Yes I agree: this is exacly what I meant. The duplication "on side 1"

is naturally created during the join process due to the relationship
1-N. But in a "chain" of joins
if you have some sum, count, ... functions on a side 1 you may get
a result that *may or may not* be "correct" (depending on your
"interpretation"
of the function). My aim was to devise a general way to produce these
queries
in such a way to be able to make the user aware of this possible
problems and
to provide him with the various alternatives, including the various
join of subqueries
when necessary.

This is what reporting software usually does.

>

> I think that it is not a theoretical problem but the problem of

The part I am talking about is the theoretical one. About a general

method to create the queries and subqueries. Clearly, this is usually
destined to be implemented, although I guess if could be performed
manually on small designs.

> implementation, i.e., how to implement certain logic. It seems that you

> know what kind of result you want to get and the only difficulty is in
> writing the corresponding SQL query. I do not say that it is simple. I
> see several alternatives how it can be encoded in SQL including nested
> queries and conventional WHERE restrictions. Yet the main difficulty is
> in the performance of such queries. For a real report such a query may
> take many pages and the ability to optimize it can be very important.

Yes That's actually one of the many problems I have in mind and that I

Alexandr Savinov

unread,

Sep 17, 2006, 9:43:29 AM9/17/06

to

pamela...@libero.it schrieb:

Here is a possible problem formulation and solution.

Given a sequence of n tables T1,T2,...,TN where each next table
references the previous table in its column ci, i=1,2,...,N. Find all
records from the last table TN which indirectly reference records from
the first table T1.

Query for 2 tables:

Deproject2 =
select T2.id
from T2, T1
where T2.c2 = T1.id // Or, alternatively, via JOIN

Solution for N tables is defined recursively:

DeprojectN =
select TN.id
from TN, (DeprojectN-1) as TN-1
where TN.cN = TN-1.id

Notes:

1. A column for aggregation should be selected in the last table instead
of the id column.

2. Additional conditions for selecting records from the first table can
be added to the query.

3. This query is used only to demonstrate the main logic (actually, it
demonstrate how bad is SQL for solving such problebs).

4. In order to use groupby you need to add the corresponding column and
may be some other options.

--
http://conceptoriented.com

pamela...@libero.it

unread,

Sep 17, 2006, 11:21:40 AM9/17/06

to

Alexandr Savinov ha scritto:

> demonstrate how bad is SQL for solving such problems).

>
> 4. In order to use groupby you need to add the corresponding column and
> may be some other options.

Thanks!! Alexandr

finally here we go! This is just the direction I meant. Just working
with formal properties of the design!

It's true that somehow we are coping with SQL limits, but it is also
true that it is expected (from programs) that they give the best answer
within these limits.

Your recursive procedure (which assume a sort of "cascade" pattern) and
proceeds backwards in direction N-1 is fine to me. Actually, so fine
that I have in mind (at an intuition stage) a generalized version of
it, to produce the subqueries, which (I hope) could work with any
"pattern" of tables/relationships and any choice of functions.

The general idea is to do a "double scan": one forward in direction 1-N
to cut the design into parts which "cannot safely stay togheter" (could
give wrong result on functions due to duplication), and another
backward in direction N-1 to build the union of subqueries...

Would you like me to try to explain it, so that you theorists can help
me to fix possible flaws and suggest corrections ? If it works I will
write the corresponding program...

-P
>
> --
> http://conceptoriented.com

Bob Badour

unread,

Sep 17, 2006, 1:20:55 PM9/17/06

to

Marshall wrote:

In other words, no matter how proud Pamela is of her ignorance, the
primary role of a dbms is not to compensate for ignorance. The primary
role of a dbms is to manage data as instructed.

It's obvious to any reasonably intelligent person that if one doesn't
want the effects of a join, one doesn't do a join. Likewise, if one
doesn't want the effects of a restrict, one doesn't do a restrict. etc.
I fact, I would expect it to seem rather obvious even to people of
rather modest intelligence.

Does anyone remember the "I hate it when I do that" sketches from SNL?

pamela...@libero.it

unread,

Sep 17, 2006, 2:03:12 PM9/17/06

to

Bob Badour ha scritto:

> In other words, no matter how proud Pamela is of her ignorance, the
> primary role of a dbms is not to compensate for ignorance.

Actually dear Bob,

I was thinking that the term you have coined for me: the "invincible
ignorant" is somehow quite * genial *, and pretty closely it is a
portrait of my personality.

I know you did not intend as a compliment, and just wanted to be rude,
to keep up with your stereotype of being the local curmudgeon, bushing
all the new entries, but anyway I do like it.

In all my life, my constant inclination was to be *creative* instead
of to be "learnitive". I always preferred to create my wheel instead of
studying the others'. And only after finishing it I would compare with
the other ones. And sometimes, it even turns out that my wheel run
better than the existing ones. When it doesn't I like to study the
differences and this process is quite helpful to me.

I really like that expression of "invincible ignorant". Actually, I do
think, seriously, that some (clearly "relative") ignorance
<space here to say that I am an absolute ignorant :)) >

is fundamental be be able to give original contributions, as usually,
once the mind has been impressed by some patterns, it is rather
difficult to escape.

"Ignorance" is, in my opinion, a fundamental ingredient for real
originality.
Clearly may not be sufficient. Further, sometimes one can do very
trivial mistakes. But I am not afraid to do wrong (especially with
software, where I can always redesign it.)

Courage to be wrong and humility to recognize when you are, are
essential to grow.

I think I will be using that expression to describe my attitude toward
life and science. Thanks.

PS
Ah and if you feel you want to say some thing technical or intelligent
instead of constantly vomiting shit, please * do not be ashamed * : I
will not think less you than what I am now thinking :)

-P

Marshall

unread,

Sep 17, 2006, 3:28:15 PM9/17/06

to

pamela...@libero.it wrote:
>
> In all my life, my constant inclination was to be *creative* instead
> of to be "learnitive".

The two are not antagonistic; in fact they are actually synergistic.
Education and creativity reinforce each other.

> I always preferred to create my wheel instead of
> studying the others'.

Widespread adoption of this technique would lead to every
generation trying to reinvent everything that came before.
It is directly counter to human progress.

> And sometimes, it even turns out that my wheel run
> better than the existing ones.

How would you know, if you hadn't studied the others?
How would you even know how to compare different
wheels if you only knew your own?

This common idea is actually the fable the ignorant
tell themselves to justify their lack of education.

> Actually, I do think, seriously, that some (clearly "relative")

> ignorance [...]
> is fundamental be be able to give original contributions [...]

The greatest scientists reject this idea. The greatest artists
likewise.

Do you think any of the great computer scientists practiced
ignorance? Just about every advance in CS I can think of
was made by someone with a PhD. About the only exception
to that rule is Robin Milner, who only had an undergraduate
degree but did spend much of his adult life in academia.

Andrew Wyeth's father made sure his son was extensively tutored
in painting, drawing, and illustration. A reported once asked him
if we wasn't afraid that all those classes might kill talent. His
father replied, "If that had killed it, it deserved to be killed."

> "Ignorance" is, in my opinion, a fundamental ingredient for real
> originality.

Really, the only thing ignorance is fundamental to is in knowing
less than the educated.

> Courage to be wrong and humility to recognize when you are, are
> essential to grow.

I am glad that you recognize this, and I look forward to your
realization that your attraction to ignorance was a severe mistake.

Although actually I don't see any way you could realize that
it's a mistake until you had tried it both ways. Why don't
you?

Marshall

Bob Badour

unread,

Sep 17, 2006, 4:17:25 PM9/17/06

to

Marshall wrote:

Marshall, your reply to Pamela was persuasively worded yet has two fatal
flaws:

1. The willfully ignorant like Pamela have the necessary arrogance to
hide from their errors behind the shield of their ignorance. You allowed
her to confuse cowardice with courage and hubris with humility without
any direct challenge to the absurdity.

2. No argument will ever persuade anyone whose psychosis suffices in the
first place to confuse cowardice with courage or hubris with humility.

pamela...@libero.it

unread,

Sep 17, 2006, 4:38:44 PM9/17/06

to

Marshall ha scritto:

Looks like you have a lot of degrees of freedom in interpretation.
I talked about "relative" ignorance. We may have quite different
standards.

Anyway, the recreational break is over. Time for lesson. Let's back to
the tech staff.

I want also to settle that terminology matter.

Could you please provide me with the formal (I want a rigorous symbolic
expression)
definition of "binary idempotent function" ?

I want to convince myself that was the right term. But I cannot find
the "binary" version.

-P

>
>
> Marshall

Marshall

unread,

Sep 17, 2006, 5:46:13 PM9/17/06

to

Bob Badour wrote:

> Marshall wrote:
>
> Marshall, your reply to Pamela was persuasively worded yet has two fatal
> flaws:

Bob,

Thanks for the feedback. As to your point 1, I freely admit the
failing. It is an area I am currently working on, and still need
work in. I'll try to digest what you said. As to your point 2,
sometimes one is more concerned with the wider audience,
even when one is addressing a specific individual. I am
particularly keen lately to knock down any hint that there
is any advantage to ignorance.

Marshall

unread,

Sep 17, 2006, 5:54:11 PM9/17/06

to

pamela...@libero.it wrote:
>
> Could you please provide me with the formal (I want a rigorous symbolic
> expression)
> definition of "binary idempotent function" ?
>
> I want to convince myself that was the right term. But I cannot find
> the "binary" version.

It's in the wikipedia entry you mentioned earlier, under
"Formal definitions / Binary operation"

http://en.wikipedia.org/wiki/Idempotent

"If S is a set with a binary operation * on it, then an element
s of S is said to be idempotent (with respect to *) if

s * s = s.

In particular, any identity element is idempotent. If every element
of S is idempotent, then the binary operation * is said to be
idempotent. For example, the operations of set union and set
intersection are both idempotent."

Consider * as binary min:

forall s, min(s, s) = s

So min meets the definition. (Likewise max.)

However, it is not the case for sum or count.

not forall s, s + s = s

So + is not idempotent. Since + is not idempotent, sum()
isn't either.

Etc.

Marshall

pamela...@libero.it

unread,

Sep 17, 2006, 5:59:41 PM9/17/06

to

> Thanks for the feedback. As to your point 1, I freely admit the
> failing. It is an area I am currently working on, and still need

I think instead that people who hide behind insults do that because are
scared to do an open confrontation with scientific arguments.

I am always ready to confront and admit my errors because I have a
sincere love for what is right and I have no problem if this is stated
by another person.

Ok I found this reference:

http://www.latrobe.edu.au/philosophy/phimvt/joy/j04alg.html

"Idempotency, zero elements and arities"

A binary function f(x,y) is called idempotent if for all x

f(x,x) = x

--------------------------------------------------

According to the above definition:

countDistinct (5,6) = 2
countDistinct is "not idempotent" (1 counterexample is enough)

max(x,x) = x for any x
max is "idempotent"

(1) Your statement is: "Duplication Sensitive" <=> Non Idempotent

if we negate both sides, we obtain an equivalent statement:

(2) "Not Duplication Sensitive" <=> Idempotent

Now observe that CountDistinct is "Not Duplication Sensitive function"
example: CountDistinct (2,2) = 1.

At the same time CountDistinct is "not idempotent" (see above).

Therefore we have found 1 case where:

(3) "Not Duplication Sensitive function" => "idempotent"

is false. Therefore (2), (1) are false.

----------------------------------------------------------

(The inverse implication of (3) is clearly true)

Do you see errors in the above argument?

-P

Bob Badour

unread,

Sep 17, 2006, 7:36:38 PM9/17/06

to

Marshall wrote:

To your credit, I cannot chastise you this time for ignoring the most
compelling argument. The other points she made were equally absurd, and
your replies persuasively showed them as complete nonsense.

What's more, you responded to her incoherence with something coherent
and made it look easy, which earns gobs and gobs of style points. As a
post to persuade third parties, I have to give you two thumbs up.

(I am only allowed to do that because I have two thumbs.)

Marshall

unread,

Sep 17, 2006, 9:30:43 PM9/17/06

to

pamela...@libero.it wrote:
>
> Ok I found this reference:
>
> http://www.latrobe.edu.au/philosophy/phimvt/joy/j04alg.html
>
> "Idempotency, zero elements and arities"
>
> A binary function f(x,y) is called idempotent if for all x
>
> f(x,x) = x

Yes, this is exactly what I've been saying, and what's been
in the wikipedia entry, from the beginning. It's almost
word for word what was in my last post.

> According to the above definition:
>
> countDistinct (5,6) = 2
> countDistinct is "not idempotent" (1 counterexample is enough)

You still don't understand how binary functions are used
to create aggregate functions. Remember back in the
beginning I said you needed to understand that? There's
no way to make the aggregate count distinct out of your
above binary function. So this is not a counterexample.

> max(x,x) = x for any x
> max is "idempotent"
>
> (1) Your statement is: "Duplication Sensitive" <=> Non Idempotent
>
> if we negate both sides, we obtain an equivalent statement:
>
> (2) "Not Duplication Sensitive" <=> Idempotent
>
> Now observe that CountDistinct is "Not Duplication Sensitive function"
> example: CountDistinct (2,2) = 1.
>
> At the same time CountDistinct is "not idempotent" (see above).
>
> Therefore we have found 1 case where:
>
> (3) "Not Duplication Sensitive function" => "idempotent"
>
> is false. Therefore (2), (1) are false.

All of the above is irrelevant because the aggregate count
distinct is not related to the binary count distinct you've
supplied.

Remember this message?

http://groups.google.com/group/comp.databases.theory/msg/a639271c4355ec5d

In it, I said "If the binary form of the aggregate

function is idempotent, the aggregate will return the same value
even if values are repeated arbitrarily. Since + is not idempotent,
sum() is "sensitive" to repeated values. Since binary min *is*
idempotent, aggregate min() is not "sensitive" to repeated
values."

Until you understand the above paragraph, you will
not understand the issue you are asking about.
In particular, you need to understand how binary functions
relate to aggregates.

I will refer to your above binary count distinct as "cd()".
I will apply it to {5, 6, 2}.

Aggregating cd() into a full aggregate count distinct
would work as follows:

cd(cd(5,6),2)

This evaluates as follows:

= cd(cd(5,6),2)
= cd(2,2)
= 1

So if we try to aggregate your cd() function over {5,6,2}
it evaluates to 1. However, there are 3 distinct values in
{5, 6, 2}, so the cd() function you supplied is not a
representation of the aggregate count distinct.

Consider how you would use + to make sum:

(5+6)+2

or binary min to make aggregate min:

min(min(5,6),2)

Consider also why every binary function used to make
an aggregate is both commutative and associative.
What if it wasn't?

Extra credit:
avg() is one aggregate that doesn't seem to fit with
the rest. Why is that?

Good luck.

Marshall

pamela...@libero.it

unread,

Sep 18, 2006, 3:14:04 AM9/18/06

to

Bob Badour ha scritto:

> To your credit, I cannot chastise you this time for ignoring the most
> compelling argument. The other points she made were equally absurd, and
> your replies persuasively showed them as complete nonsense.
>
> What's more, you responded to her incoherence with something coherent
> and made it look easy, which earns gobs and gobs of style points. As a
> post to persuade third parties, I have to give you two thumbs up.
>
> (I am only allowed to do that because I have two thumbs.)

With your sharp sense of logic,
If you were to write a DB program it would fail after the first 10
lines of code.

Luckily nobody allows you to that.

-P

pamela...@libero.it

unread,

Sep 18, 2006, 3:27:25 AM9/18/06

to

Marshall ha scritto:

> > A binary function f(x,y) is called idempotent if for all x
> >
> > f(x,x) = x
>
> Yes, this is exactly what I've been saying, and what's been
>

> You still don't understand how binary functions are used

> In it, I said "If the binary form of the aggregate
> function is idempotent, the aggregate will return the same value
> even if values are repeated arbitrarily. Since + is not idempotent,
> sum() is "sensitive" to repeated values. Since binary min *is*
> idempotent, aggregate min() is not "sensitive" to repeated
> values."
>

Marshall,

Your statement

"If the binary form of the aggregate
function is idempotent, the aggregate will return the same value
even if values are repeated arbitrarily. Since + is not idempotent,
sum() is "sensitive" to repeated values. Since binary min *is*
idempotent, aggregate min() is not "sensitive" to repeated
values."

is TRUE. I already told you 40 posts ago. .

what it states is formally:

idempotent => "Not replication sensitive"

This is true. What I am stating is that

"Not replication sensitive" => "binary idempotent"

is a false statement.

In fact:

1. Count Distinct is "Not Replication sensitive" but also
2 Count Distinct is not binary Idempotent
e.g. countDistinct (5,5) = 1 => Count distinct is not binary
idempotent

so it is NOT true that

"Not replication sensitive" => "idempotent"

Therefore

binary idempotent <=> "Not replication sensitive"

does not hold.

If it does not hold for the binary case (n=2), it does not hold in
general.

This is elementary logic. I have started from the definition.

This is not philosophy. Please use logic/math argument to respond.
Do not just say that I do not understand. That's not a logic argument.

If my argument has a flaw, please, Marshall or Bob or anyone, point out
* exacly where it is * so that I can see it.

I am not afraid to recognize it.

-P

> Marshall

David Cressey

unread,

Sep 18, 2006, 7:44:04 AM9/18/06

to

"Marshall" <marshal...@gmail.com> wrote in message
news:1158529573.2...@e3g2000cwe.googlegroups.com...

There is an advantage to ignorance, as I pointed out to you about a year
ago. Unfortunately, the advantage is in the shedding. One can't profit
repeatedly from the same point of ignorance. If one is unwilling to learn
(which the phrase "invincibly ignorant" implies) then one learns the same
truth over and over again, but forgets it as soon as it threatens to undo
one's ignorance.

You already pointed out that the lowest table in Pam's example is not in
1NF. The same is true of the two lowest tables in the original example:
TRANS_A and TRANS_B. Pam's current theory is that GROUP BY produces the
wrong groups.

I haven't attempted to use GROUP BY with data that is not in 1NF, but I
would expect certain anomalies to appear. And I expect that summing an
extensive measure is among them. I also expect that counting the rows in
each group would produce anomalies. This has nothing to do with limitations
in SQL. It has to do with inability to tell duplicates apart, inherent in
data that is not in 1NF.

Marshall

unread,

Sep 18, 2006, 12:51:02 PM9/18/06

to

David Cressey wrote:
>
> There is an advantage to ignorance, as I pointed out to you about a year
> ago. Unfortunately, the advantage is in the shedding. One can't profit
> repeatedly from the same point of ignorance. If one is unwilling to learn
> (which the phrase "invincibly ignorant" implies) then one learns the same
> truth over and over again, but forgets it as soon as it threatens to undo
> one's ignorance.

I stand corrected!

Marshall

unread,

Sep 18, 2006, 1:11:38 PM9/18/06

to

pamela...@libero.it wrote:
>
> This is not philosophy. Please use logic/math argument to respond.
> Do not just say that I do not understand. That's not a logic argument.
>
> If my argument has a flaw, please, Marshall or Bob or anyone, point out
> * exacly where it is * so that I can see it.

Pamela,

I already did that.

I can think of five possible explanations for your above post:

1) There is some language barrier that is preventing you from
understanding what I'm writing.

2) You are skimming over my posts and not reading them
thoroughly enough.

3) You lack sufficient intelligence to understand what I'm saying.

4) You are not willing to do the work to understand what I'm saying

5) You are a troll and are only pretending to be interested in this
question.

In none of the above circumstances is it worth my time to continue
trying
to explain this to you.

On the outside chance that you actually are motivated and capable
of mastering this material, I will again say that I've already
explained
all the necessary points, and you have only to read my posts
*carefully* to figure out why your count distinct argument doesn't
work.

Marshall

kvnkr...@gmail.com

unread,

Sep 18, 2006, 2:14:42 PM9/18/06

to

The flaw is simple. Marshall was talking about the binary forms of
aggregate functions. Count Distinct (a,b) is not the binary form of
the aggregate Count Distinct (S).

> I am not afraid to recognize it.

If it's not fear, then what is it? Pride? Lack of intelligence?

>
> -P
>
> > Marshall

pamela...@libero.it

unread,

Sep 18, 2006, 6:15:30 PM9/18/06

to

kvnkr...@gmail.com ha scritto:

> The flaw is simple. Marshall was talking about the binary forms of
> aggregate functions. Count Distinct (a,b) is not the binary form of
> the aggregate Count Distinct (S).
>

kvnkr...@gmail.com ha scritto:

> The flaw is simple. Marshall was talking about the binary forms of
> aggregate functions. Count Distinct (a,b) is not the binary form of
> the aggregate Count Distinct (S).
>
>

This is crap. So you mean if I have 2 records your equivalence does not
hold and if I have more than that it holds? If this were true we still
say that does not hold. If the property does not hold for any n, it
simply does not hold.

Ah. You guys prove that you are unable to do scientific talk.

How can you talk about data management if you are missing the most
basic way to discuss formally.

Again. Please take from my proof the line(s) which you believe it's
wrong and make a concise but clear *proof * that it is. This is would
be a scientific approach.

I started from the binary definition and gave precise statement (even
numbered the lines).

As I have taken the time to give a math proof that there is no
equivalence between the concept of "idempotence" and "duplication
insensitivity".
I may be right or wrong, but you should do the same, otherwise you are
just a poor fool.

If you do not do that or you are not able to do that, you have just
proved what you are.

Be serious and talk math / logic , please.

-P

pamela...@libero.it

unread,

Sep 18, 2006, 7:54:32 PM9/18/06

to

kvnkr...@gmail.com ha scritto:

> If it's not fear, then what is it? Pride? Lack of intelligence?

I repeat this for the last time
==========================

(1) Consider n=2, (n is the number of operands),
what is count distinct of { 3 , 3 } ?

(2) Answer 1. Agree ? If yes proceed.

So. Does Count Distinct satisfy this definition:

(3) A binary function f ( x , y ) is called idempotent
if for all x ,
f ( x , x ) = x ?

(4) count distinct of { 3 , 3 } = 1, which is different from 3.
It doesn't.

(5) So, for n=2 Count Distinct is NOT idempotent

(6) Now we know that Count Distinct is "Replication Insensitive"

Therefore,
(7) "replication Insensitive" => "binary idempotent"
does not hold.

(8) Therefore. The notion of idempotence and "Replication
insensitivity"
are not the same thing, for n=2.

(9) We have just seen that for n=2.
This is sufficient to say that equivalence does not hold.

Tell me the precise statement which is wrong
and prove that it is wrong. Show me your intelligence.

-P

Marshall

unread,

Sep 18, 2006, 8:38:45 PM9/18/06

to

pamela...@libero.it wrote:
>
> So. Does Count Distinct satisfy this definition:
>
> (3) A binary function f ( x , y ) is called idempotent
> if for all x ,
> f ( x , x ) = x ?

No, it does not satisfy the definition. count distinct,
as understood in SQL, is not a binary function. It
is an aggregate function. They are not the same.
An aggregate function invoked on a relation of size
two is not the same as a binary function.

> Tell me the precise statement which is wrong
> and prove that it is wrong. Show me your intelligence.

Statement 3 is wrong. It is not a question of proof,
but rather a question of meeting a definition or not.
Count distinct does not meet the definition.

Marshall

JOG

unread,

Sep 18, 2006, 9:21:32 PM9/18/06

to

pamela...@libero.it wrote:
> kvnkr...@gmail.com ha scritto:
>
> > If it's not fear, then what is it? Pride? Lack of intelligence?
>
>
>
> I repeat this for the last time
> ==========================
>
> (1) Consider n=2, (n is the number of operands),
> what is count distinct of { 3 , 3 } ?
>
> (2) Answer 1. Agree ? If yes proceed.
>
>
> So. Does Count Distinct satisfy this definition:
>
> (3) A binary function f ( x , y ) is called idempotent
> if for all x ,
> f ( x , x ) = x ?
>
>
> (4) count distinct of { 3 , 3 } = 1, which is different from 3.
> It doesn't.

Wuh? Count_distinct that you were discussing takes a mult_set as input.
What's that got to do with the binary operation you have listed above?
You certainly can't build the aggregate version of count_distinct out
of it. Consider min:

e.g.: amin( {a,b,c,d} ) = bmin(a, bmin(b, bmin(c,d)))

and note the difference between the aggregate min (amin) which takes a
multiset, and the binary form (bmin) which has two inputs - they are
totally different beasts, and I believe you are confusing them.

>
>
> (5) So, for n=2 Count Distinct is NOT idempotent

Hence this is an incorrect conclusion, and further steps are therefore
incorrect too.

Bob Badour

unread,

Sep 19, 2006, 12:39:34 AM9/19/06

to

JOG wrote:

> pamela...@libero.it wrote:
>
>>kvnkr...@gmail.com ha scritto:
>>
>>
>>>If it's not fear, then what is it? Pride? Lack of intelligence?
>>
>>
>>
>>I repeat this for the last time
>>==========================
>>
>>(1) Consider n=2, (n is the number of operands),
>> what is count distinct of { 3 , 3 } ?

It is the same as COUNT{3}

>>(2) Answer 1. Agree ? If yes proceed.
>>
>> So. Does Count Distinct satisfy this definition:
>>
>>(3) A binary function f ( x , y ) is called idempotent
>> if for all x ,
>> f ( x , x ) = x ?

Count distinct is a composition of two functions. Distinct projects onto
a set of attributes. Count distinct then counts the result of the
project. Count is not idempotent, but in this case it does not operate
on any duplication. (Replication means something different in the
database field.)

>>(4) count distinct of { 3 , 3 } = 1, which is different from 3.
>> It doesn't.

Using set notation: { 3, 3 } = { 3 } = { 3, 3, 3} = { 3, ... 3, 3}
The count (cardinality) of the above set is 1 regardless how many times
we repeat 3 in its representation.

pamela...@libero.it

unread,

Sep 19, 2006, 3:17:16 AM9/19/06

to

Bob Badour ha scritto:

Count is not idempotent, but in this case it does not operate
> on any duplication. (Replication means something different in the
> database field.)
>

I also agree (can't believe I agree with Bob :) that for terminology
it's better to avoid
"replication" word in favor of "duplication" (if you see a better
term please **notice**)

If idempotent is not the term. What is a good term?

quoted from:
http://groups.google.it/group/sci.math/browse_frm/thread/17781c0124345f6e?scoring=d&hl=it

William Hughes

>More important perhaps, CountDistinct(2,2) = 1

>It is clear that "replication insensitive" cannot be the same as
>idempotent.
>For one thing, any constant function that can be replication
>insensitive
>(the definition given is not complete) is. No idempotent function can
>be constant.

http://groups.google.it/group/comp.databases.theory/browse_frm/thread/538735631f141027?hl=it

Marshall

>cd(cd(5,6),2)
...

>Consider how you would use + to make sum:
>(5+6)+2
>or binary min to make aggregate min:
>min(min(5,6),2)

>Extra credit:
>avg() is one aggregate that doesn't seem to fit with
>the rest. Why is that?

3 questions:

Now let's proceed further
when you say:

1
let me understand, are you implying that you could
compute the aggregate function recursively ?

2
Would you please explain how do you compute the
following "aggregate" functions for n= 3:

- range of values (difference between max and min)
- median
- standard deviation

3.
would you please provide the "binary" (n=2)
version of the above listed functions

-P

Marshall

unread,

Sep 19, 2006, 4:04:08 AM9/19/06

to

pamela...@libero.it wrote:
>
> Marshall
> [...]

> 3 questions:
>
> Now let's proceed further
> when you say:

Pamela,

I'm sorry but I've already put a fair bit of time into helping
you and you've responded with insults and innuendo, and
haven't given any indication that you've done any more
than simply skim my responses. If you're not willing
to carefully read what I write, why should I write still more?

If you want to discuss this further, I would be happy to
proceed if you can demonstate that you understand the
difference between a binary function and an aggregate
function, and that you have figured out how they are
related. It's been posted about five times in the
last 24 hours.

Marshall

pamela...@libero.it

unread,

Sep 19, 2006, 4:37:43 AM9/19/06

to

Marshall ha scritto:

I would be happy to
> proceed if you can demonstate that you understand the

"Real" Mathematicians have told you that you are wrong
and you do not admit it.

Anyway, I understand you more than you think.
I have even understood your "perspective".

You are actually reasoning within a subset
of aggregate functions. Those which can be expressed,
say, through a "binary generator".

Even in that case William Hughes has suggested
that equivalence is not proven.

And in any case we have not seen any proof from you.

I yield my title.

-P

> Marshall