How to normalize this?

hugoko...@gmail.com

unread,

Feb 8, 2013, 3:33:27 PM2/8/13

to

Hi all,

I've been thinking about a scenario for a few days now, and I still am not sure how this should be normalized. Maybe someone here can help?

The first part is easy. I have five attributes (a, b, c, d, e) and the following FDs:
1. {a, b, c} --> d
2. {a, b, c} --> e
Obviously, just a single relation with all five attributes and (a, b, c) as candidate key.

But then I find that the second FD isn't full. It should be replaced with:
2a. {a, b} --> e
2b. {b, c} --> e

The single relation now obviously violates 2NF. But how to replace it? The only option I see is to remove e from the first relation, and add a relation (a, b, e), with candidate key (a, b). But that does not represent FD 2b, and it allows extensions that would violate it. Nothing I could not fix with additional code in the RDBMS, but I don't like using code to prevent FD violations.

Does anyone know a better way to normalize this set of FDs?

Cheers,
Hugo

derek.a...@gmail.com

unread,

Feb 8, 2013, 7:39:40 PM2/8/13

to

On Saturday, 9 February 2013 07:33:27 UTC+11, hugoko...@gmail.com wrote:
>
> Does anyone know a better way to normalize this set of FDs?

Of course. The solution is simple.

I don't know why I bother, but I will give you one more chance. I am happy to explain things to someone who is genuinely seeking undestanding; I am not happy to argue with someone who is fixed in their understanding.

In order to ensure that, in consideration of your behaviour in the past, I will lay down some ground rules for your interaction with me. If you do any of the following, I will stop reading your posts:
• lie
• contradict yourself or behave hypocritically (doing exactly what you have railed against)
• repeat the wrong thing, after what you have written has been identified as the wrong thing
• demonstrate an inability to think logically.
• demonstrate that you are not genuinely trying to learn (the question posed implies that (a) you do not know and (b) you want to learn, but your past behaviour demonstrated an intention that you were not genuine, you were evidently here for the engagement, and fixed on your beliefs). if you are unteachable, the exercise is a waste of time.
(Remember, you posted the question, which means you do not know the answer.)
• Respond to my explanatory posts in less than 24 hours. (I spend a lot of time putting them together. You need to treat that with respect, and think about them carefully, re-read them several times if necessary, before responding. That will indicate that you are genuinely learning. Knee-jerk reactions will demonstrate that you are doing so from your existing set of beliefs and understandings (which by definition means that you will remain in your present state of lack of knowledge; that you are not open to learning anything).

I need a one-line response from you that explicitly accepts or rejects this. If you accept, I will answer your question.

Cheers
Derek

hugoko...@gmail.com

unread,

Feb 8, 2013, 7:56:58 PM2/8/13

to

Op zaterdag 9 februari 2013 01:39:40 UTC+1 schreef derek.a...@gmail.com het volgende:

> On Saturday, 9 February 2013 07:33:27 UTC+11, hugoko...@gmail.com wrote:
>
> >
>
> > Does anyone know a better way to normalize this set of FDs?
>
>
>
> Of course. The solution is simple.

Great. So post it. Given the amount of attributes and FDs, just posting it would probably have been less effort than writing the message you did.

Or don't post it, if you relish claiming to be far superior without bothering to prove it. But don't be surprised if I stop taking you seriously if all you do is claim to know something and then fail to deliver.

> I don't know why I bother, but I will give you one more chance. I am happy to explain things to someone who is genuinely seeking undestanding; I am not happy to argue with someone who is fixed in their understanding.
>
>
>
> In order to ensure that, in consideration of your behaviour in the past, I will lay down some ground rules for your interaction with me. If you do any of the following, I will stop reading your posts:
>
> • lie
>
> • contradict yourself or behave hypocritically (doing exactly what you have railed against)
>
> • repeat the wrong thing, after what you have written has been identified as the wrong thing
>
> • demonstrate an inability to think logically.

I have never knowingly done any of the above. Nor do I plan to.

> • demonstrate that you are not genuinely trying to learn (the question posed implies that (a) you do not know and (b) you want to learn, but your past behaviour demonstrated an intention that you were not genuine, you were evidently here for the engagement, and fixed on your beliefs). if you are unteachable, the exercise is a waste of time.
>
> (Remember, you posted the question, which means you do not know the answer.)

I posted the question, and the best answer I could find myself. I have considered several other answers before, then carefully tried to find flaws in them. And rejected them. I don't like my final answer either, which is why I posted.
But I *will* give any answer I get here the same treatment I gave to my own attempts - try to shoot holes in them as hard as I can. If I can't find any problems, I'll thank you and I'll have learned. If I do see problems, I'll tell you, and we'll both have learned.

Me asking does not mean I should take replies I get, from you or from anyone else, as gospel. If that's what you want me to do, then please don't reply.

> • Respond to my explanatory posts in less than 24 hours. (I spend a lot of time putting them together. You need to treat that with respect, and think about them carefully, re-read them several times if necessary, before responding. That will indicate that you are genuinely learning. Knee-jerk reactions will demonstrate that you are doing so from your existing set of beliefs and understandings (which by definition means that you will remain in your present state of lack of knowledge; that you are not open to learning anything).

As I told you in my email, I do not get notifications of new posts in this topic, and I stopped regularly visiting usenet groups a long time ago. If you had sent me a heads up that you responded (as you did with your earlier response), I would have seen it sooner.

> I need a one-line response from you that explicitly accepts or rejects this. If you accept, I will answer your question.

Since your post is a list of conditions that will make you stop reading my posts, there is nothing for me to accept or reject. I have no way to force you to read or not read my posts, that's entirely your call.

But I would very much appreciate if someone could post a reply to my question here. And if you are that someone, than I also suggest that we keep the issues we have in that other topic seperate from this topic.

Cheers,
Hugo

derek.a...@gmail.com

unread,

Feb 8, 2013, 8:42:12 PM2/8/13

to

On Saturday, 9 February 2013 11:56:58 UTC+11, hugoko...@gmail.com wrote:
>
> Op zaterdag 9 februari 2013 01:39:40 UTC+1 schreef derek.a...@gmail.com het volgende:
>

> > Of course. The solution is simple.

Oh, I failed to mention the bleeding obvious: the solution is simple, but the explanation is not.

It appears you understand that:

> I posted the question, and the best answer I could find myself. I have considered several other answers before, then carefully tried to find flaws in them. And rejected them. I don't like my final answer either, which is why I posted.

> Or don't post it, if you relish claiming to be far superior without bothering to prove it. But don't be surprised if I stop taking you seriously if all you do is claim to know something and then fail to deliver.

You can't prove or disprove anything to a mental patient. Your attacks and insinuations of same have no credibility. Call me a murderer, it has no effect, I do not have the need to prove otherwise. I won't be baited or tricked into responding and engaging without resolution.

> I have never knowingly done any of the above. Nor do I plan to.

Ok, so you are not only ignorant, but you are ignorant that you are ignorant.

To understand the relevance of this, find and read a great paper by Justin Kruger and David Dunning named "Unskilled and Unaware". The people like you screamed in agony when the read it, and people like Fagin and Date mounted insane "logic" attacks, so they wrote a second paper that shut the lunatics up. Just read the first one for now.

> But I *will* give any answer I get here the same treatment I gave to my own attempts - try to shoot holes in them as hard as I can.

Ok, so you are proving that you will be argumentative and unteachable.

> If I can't find any problems, I'll thank you and I'll have learned. If I do see problems, I'll tell you, and we'll both have learned.

Completely in-credible.

> Me asking does not mean I should take replies I get, from you or from anyone else, as gospel. If that's what you want me to do, then please don't reply.

See, there is the madman exposing himself. Idiotic references to extremes which are out of scope.

> As I told you in my email, I do not get notifications of new posts in this topic, and I stopped regularly visiting usenet groups a long time ago. If you had sent me a heads up that you responded (as you did with your earlier response), I would have seen it sooner.

Another incredibility, given the fact that, despite what you say, you took just 18 minutes to notice my post *and* to respond to it.

> ... I also suggest that we keep the issues we have in that other topic seperate from this topic.

I am not schizophrenic. I am not programmed to tolerate insane contradictions. If you are a bear in that room, you will be a bear in this room. Which is why I suggested boundaries in this room. Your post merely proves that you are very attached to being a bear; you cannot give it up and take a learning position; and you reject my terms.

> Since your post is a list of conditions that will make you stop reading my posts, there is nothing for me to accept or reject.

You do not have the courtesy to accept/reject the stated terms like a gentleman. No, you have to post evidence of insane thinking, and reject it on your terms. Which is fine, and it is a rejection of my terms, no matter what you call it, no matter how much you dance around it, no matter if you re-define the terms as non-terms. (I notice you have your own private definitions for many things.

> But I would very much appreciate if someone could post a reply to my question here.

Sure, enjoy your theoretical discussions.

Goodbye
Derek

hugoko...@gmail.com

unread,

Feb 9, 2013, 5:27:42 AM2/9/13

to

Op zaterdag 9 februari 2013 02:42:12 UTC+1 schreef derek.a...@gmail.com het volgende:

(snipped ad-hom)

> Another incredibility, given the fact that, despite what you say, you took just 18 minutes to notice my post *and* to respond to it.

Check the timestamps on my posts. I came here to post this question. After posting, I noticed new posts in the other topic, read them, and started working on my reply. Which took me a few hours - you put a lot of effort in that post, and it warranted a similar effort on my part. Then, when I posted that reply, I saw your new post here and responded.
Not exactly rocket science.

> Which is fine, and it is a rejection of my terms, no matter what you call it, no matter how much you dance around it, no matter if you re-define the terms as non-terms. (I notice you have your own private definitions for many things.

Your terms include "you must not lie". But all your posts demonstrate that you apply the term "lie to any arguments you can't refute, and that you insult whoever posts evidence that you are wrong. I do not lie by any generally accepted dictionary definition of the word "lie". By your personal, completely different definition, I apparently do lie.

Your terms include "you must guarantee to reply to my posts within 24 hours". But you yourself are, apparently, from to take much longer to respond to my posts. (Evidence is in the other topic, wherre I posted on Feb 5 and you responded on Feb 7). Double standards.

Your terms include "you must guarantee to reply within 24 hours", and "you must take the time to carefully read my posts, several times, and ponder them, before replying". Given timezone differences, family obligations, jobs, etc, posting *any* reply within 24 hours is hard; posting a well thought-out reply after carefully pondering your words is impossible to guarantee.

I could go on, but I think the point is clear. Your stated terms are not designed to guarantee a valuable conversation, they are designed to lure me in a trap. Given the number of times you have already accused me, completely unfounded, of lying, being illogical, being argumentative, contradicting myself, etc, I have no reason to believe you'll stop doing so after I agree to these terms. I have no intention of giving you more ammunition to drag the quality of discussion down to the level of mudslinging contests when you are already perfectly capable of doing so without it.

You obviously had no intention of sharing the answer to my question, if you even know it at all. You simply could not resist the urge to continue your endless stream of unfounded ad-hominem, and to boost your self-esteem by publicly claiming to know something I don't.

I am still interested in getting an answer to my original question in this topic, from whoever is prepared to help me out.
I am not interested in your opinions about my mental health, my idiocy, my argumentativeness, or whatever other nonsense you may come up with.

Norbert_Paul

unread,

Feb 9, 2013, 6:50:46 AM2/9/13

to

hugoko...@gmail.com wrote:
> Does anyone know a better way to normalize this set of FDs?

There is an algorithm thad can do this for you:
Google the terms "synthesis algorithm third normal form".

Norbert

hugoko...@gmail.com

unread,

Feb 10, 2013, 8:14:42 AM2/10/13

to

Op zaterdag 9 februari 2013 12:50:46 UTC+1 schreef Norbert_Paul het volgende:

> There is an algorithm thad can do this for you:
>
> Google the terms "synthesis algorithm third normal form".

Hi Norbert,

Thanks for the pointer. Very useful!

I used the algorithm on my set of FDs and this resulted in three relations:

R1: ABCD (ABC --> D)
R2: ABE (AB --> E)
R3: BCE (BC --> E)

Interestingly, that was the first thing I came up with when thinking about this problem. I thought it would be incorrect because
(a) is has an update anomaly (changing an E value in R2 without making the accompanying change in R3 would result in inconsistent data); and
(b) R3 is redundant, it can be reconstructed by joining R1 and R2.

I am not really surprised at (b), because this does not follow from just the FDs, but from a combination of the FDs and the original design, a single relation ABCDE, which implies that there will be no (AB) or (BC) value combinations in R2 or R3 that are not in R1. That fact is not an input in Bernsteins algorithm, so it makes sense that all three relations are produced.

I am surprised at (a). I was under the impression that a fully normalized is free of modification anomalies. I checked the result of Bernsteins algorithm against the higher normal forms, but found no violation.

So either:
(i) I made a mistake when applying Bernsteins algorithm (improbable, it's very straightforward and I had only three FDs to begin with), or
(ii) I overlooked a violation of one of the higher normal forms (possible, I learned normalization from one of the idiots who think that 3NF is enough an don't bother with the higer forms, so I had to study that later, without help), or
(iii) I was misinformed about full normalization eliminating ALL modification anomalies, and found a case where it doesn't. And where it's maybe not even possible to achieve a relational design without modification anomalies.

Do you see any flaws in my reasoning?

Cheers,
Hugo

Eric

unread,

Feb 10, 2013, 9:45:16 AM2/10/13

to

I think (i) applies, the mistake being not giving the algorithm all the
information. Given _only_ the 3 FDs and nothing else, is updating

(a1, b1, c1, d1)
(a1, b1, e1)
(b1, c1, e1)

to

(a1, b1, c1, d1)
(a1, b1, e1)
(b1, c1, e2)

producing inconsistent data? I can not see any reason why it would be,
except your "fact" about the original design, which did not feed into
the algorithm. Without that "fact" you have neither an update anomaly
nor a redundancy. If you made the correct decision about the original
FD being wrong then your "fact" is not a fact at all.

The other possibility is that the new set of FDs is still wrong, and
there is other information that that has not come to light. You may have
to go back to the real world and make sure you have the complete set of
FDs.

Eric
--
ms fnd in a lbry

hugoko...@gmail.com

unread,

Feb 10, 2013, 11:24:40 AM2/10/13

to

Op zondag 10 februari 2013 15:45:16 UTC+1 schreef Eric het volgende:

> I think (i) applies, the mistake being not giving the algorithm all the
>
> information. Given _only_ the 3 FDs and nothing else, is updating
>
>
>
> (a1, b1, c1, d1)
>
> (a1, b1, e1)
>
> (b1, c1, e1)
>
>
>
> to
>
>
>
> (a1, b1, c1, d1)
>
> (a1, b1, e1)
>
> (b1, c1, e2)
>
>
>
> producing inconsistent data? I can not see any reason why it would be,
>
> except your "fact" about the original design, which did not feed into
>
> the algorithm. Without that "fact" you have neither an update anomaly
>
> nor a redundancy. If you made the correct decision about the original
>
> FD being wrong then your "fact" is not a fact at all.

Hi Eric,

Yes, you are right. With only the given FDs, that would not be an inconsistency. I had overlooked that in my previous reply; it is only an inconsistency in the original problem (which gave the information that (ABCDE) was a good solution until I found the stricter FDs.

> The other possibility is that the new set of FDs is still wrong, and
>
> there is other information that that has not come to light. You may have
>
> to go back to the real world and make sure you have the complete set of
>
> FDs.

Unfortunately, there is no real world problem to go back to. This problem stems from my thinking about normalization and trying to understand how the rules behave with edge cases.

However, I did at one point start to doubt if, perhaps, I had formulated a set of FDs that could impossible exist in the real world, so I then set myself the task of trying to find a (more or less) realistic scenario with this set of FDs. I managed to find something that, while probably not realistic enough to qualify as "real world" (so please don't comment on that), is believable enough to qualify as a classroom exercise.

A charity is organizing a big fundraiser event. Several rock bands and several opera singers will participate. Throughout the country, in different locations, there will be performances where one rock band and one opera singer will sing one song together. This event will take place over a month. On any given date, there may be performances in different locations. At any given locations, there may be performances at different dates.
A band can play more than one song on a given date, but not with the same opera singer. They might have another performance with the same singer at another date, either at the same or another location, and either with the same or another song. If a band does play with multiple opera singers at the same date, it may be the same or a different song.
An opera singer can play more than one song on a given date, but not with the same band. They might have another performance with the same band at another date, either at the same or another location, and either with the same or another song. If an opera singe does play with multiple bands at the same date, it may be the same or a different song.

Functional dependencies so far: (Band, Singer, Date) --> Location and (Band, Singer, Date) --> Song.

The additional requirement is: If a band plays multiple songs on the same date, they all have to be in the same location. And the same restriction goes for opera singers. Though technically possible to have a performing artist perform in New York and London on the same day, the organization lacks the travel budget for it. So now we have (Band, Date) --> Location and (Singer, Date) --> Location, both superseding the original (Band, Singer, Date) --> Location.

In this example scenario, if Placido Domingo and Fun are scheduled to play "Imagine" in London on March 15, 2013, and I change the location for Placido Domingo to New York, I have to change it for Fun too - otherwise they won't be performing together!

In the original, single-table version, this is not a problem. It's one single update. But that violates 2NF.
After solving that, we now have this issue. I can formulate the requirement that would be violated in terms of the example (if a band and singer perform on the same date, they must be in the same location on that date). Or in abstract terms for the original problem (for any (a1, b1, c1) in R1, there must be a (a1, b1, e1) in R2 and a (b1, c1, e2) in R3, such that e1 = e2 - forgive me if some of the terminology is wrong; I think about this in Dutch and have to translate). But I am not able to transform this requirement into the form of a regular functional dependency.

Cheers,
Hugo

Norbert_Paul

unread,

Feb 11, 2013, 4:09:55 AM2/11/13

to

hugoko...@gmail.com wrote:
> Op zaterdag 9 februari 2013 12:50:46 UTC+1 schreef Norbert_Paul het volgende:
>> There is an algorithm thad can do this for you:
>>
>> Google the terms "synthesis algorithm third normal form".
>
> Hi Norbert,
>
> Thanks for the pointer. Very useful!

You are welcome.

> I used the algorithm on my set of FDs and this resulted in three relations:
>
> R1: ABCD (ABC --> D)
> R2: ABE (AB --> E)
> R3: BCE (BC --> E)
>
> Interestingly, that was the first thing I came up with when thinking about this problem. I thought it would be incorrect because
> (a) is has an update anomaly (changing an E value in R2 without making the accompanying change in R3 would result in inconsistent data); and
> (b) R3 is redundant, it can be reconstructed by joining R1 and R2.
>
> I am not really surprised at (b), because this does not follow from just the FDs, but from a combination of the FDs and the original design, a single relation ABCDE, which implies that there will be no (AB) or (BC) value combinations in R2 or R3 that are not in R1. That fact is not an input in Bernsteins algorithm, so it makes sense that all three relations are produced.
>
> I am surprised at (a). I was under the impression that a fully normalized is free of modification anomalies. I checked the result of Bernsteins algorithm against the higher normal forms, but found no violation.
>
> So either:
> (i) I made a mistake when applying Bernsteins algorithm (improbable, it's very straightforward and I had only three FDs to begin with), or
> (ii) I overlooked a violation of one of the higher normal forms (possible, I learned normalization from one of the idiots who think that 3NF is enough an don't bother with the higer forms, so I had to study that later, without help), or
> (iii) I was misinformed about full normalization eliminating ALL modification anomalies, and found a case where it doesn't. And where it's maybe not even possible to achieve a relational design without modification anomalies.
>
> Do you see any flaws in my reasoning?

I must admit that I didn't have a closer look at your FDs yet.
So still without going deeper into them I guess that (ii) holds
because Bernsteins algorithm only guarantees 3NF and does not
guarantee BCNF, for example.
Try to avoid words like "idiot". It is often possible to write
statements in a friendly manner that make the reader think these
words.

Cheers
Norbert

hugoko...@gmail.com

unread,

Feb 11, 2013, 11:20:44 AM2/11/13

to

Op maandag 11 februari 2013 10:09:55 UTC+1 schreef Norbert_Paul het volgende:

> I must admit that I didn't have a closer look at your FDs yet.
>
> So still without going deeper into them I guess that (ii) holds
>
> because Bernsteins algorithm only guarantees 3NF and does not
>
> guarantee BCNF, for example.

Yeah, that's what I thought too. But I just don't see which NF is violated, nor how to resolve it.

> Try to avoid words like "idiot". It is often possible to write
>
> statements in a friendly manner that make the reader think these
>
> words.

You are right. I normally try to avoid harsh language. I reread my message and see that one slipped in this time. My apologies for that, and thanks for the friendly pointer.

Cheers,
Hugo

Eric

unread,

Feb 11, 2013, 4:36:11 PM2/11/13

to

On 2013-02-10, hugoko...@gmail.com <hugoko...@gmail.com> wrote:

> Op zondag 10 februari 2013 15:45:16 UTC+1 schreef Eric het volgende:

...

The scenario is plausible, but I don't feel any more comfortable with it
than I did with the abstract example. I think (Band, Date) --> Location
and (Singer, Date) --> Location are correct (they correspond to
performer itineraries), but all I can get to after that is

(Band, Singer, Date, Location) --> Song

Take any single thing out of the left-hand side and either you can not
identify the Song, or no Song is possible.

I'm afraid I do not really know where to go from there.

James K. Lowden

unread,

Feb 11, 2013, 11:58:10 PM2/11/13

to

On Mon, 11 Feb 2013 21:36:11 +0000
Eric <er...@deptj.eu> wrote:

> > I can formulate the requirement that would be violated in terms of
> > the example (if a band and singer perform on the same date, they
> > must be in the same location on that date).
>

> The scenario is plausible, but I don't feel any more comfortable with
> it than I did with the abstract example. I think (Band, Date) -->
> Location and (Singer, Date) --> Location are correct (they correspond
> to performer itineraries), but all I can get to after that is
>
> (Band, Singer, Date, Location) --> Song

{Band, Singer, Date} -> Location
{Band, Singer, Date, Song}

because "If a band plays multiple songs on the same date, they all have

to be in the same location."

That leaves

1. (Band, Date) --> Location
2. (Singer, Date) --> Location
3. (Band, Singer, Date, Song)

with the constraints that

4. every {Band, Date} in (3) appears in (1)
5. every {Singer, Date} in (3) appears in (2)
6. every {Band, Singer, Date} in (3) appears in the
join of {(1),(2)} on {Date, Location}

It is perfectly possible that

7. there are dates for which a band or singer is not available (has no
location, per CWA), or
8. a band/singer is at a location but won't sing (no song)

> > ... if ... I change the location for Placido Domingo to New York,

> > I have to change it for Fun too - otherwise they won't be
> > performing together!

Yes. Why does that bother you? Two facts must change -- place and
performance -- else there is no song.

Codd referred to "time-varying relations", and this is illustrative.
Let us say on Tuesday Placido's agent agrees on the New York location,
but Fun has not. Update (2), delete from (3) because, as of
Tuesday, that's all you've got. Wednesday Fun's agent calls to commit
to New York. Update (1), insert into (3).

> > In the original, single-table version, this is not a problem. It's
> > one single update. But that violates 2NF.

It violates 2NF or not, depending on what you're keeping track of. If
you're only tracking performances (not bands and singers), the
relationship

{Band, Singer, Date, Song} --> {Location}

suffices, provided these are unique:

{Singer, Date, Song}
{Band, Date, Song}
{Band, Singer, Location, Song}

to prevent a singer or band from appearing in two places at once, and
to prevent a singer from singing with two bands on the same night.

HTH.

--jkl

hugoko...@gmail.com

unread,

Feb 12, 2013, 5:09:22 AM2/12/13

to

James, Eric - thanks for your input on this.

You've given me lots of good input. I'll probably think about the scenario some more over the days to come. I can't say I really *like* any of the solutions, but sometimes you just have to accept that it doesn't get any better - and this may well be such a case.

Thanks!

Cheers,
Hugo

Derek Asirvadem

unread,

Feb 12, 2013, 10:01:14 AM2/12/13

to

James

On Tuesday, 12 February 2013 15:58:10 UTC+11, James K. Lowden wrote:

*** These minor points do not subtract from the value of your post, which is good. ***

> Codd referred to "time-varying relations", and this is illustrative.

That is not correct.

Codd's "time-varying relations" are specifically about consistency. Your post is illustrative enough without making the reference. Your answer is about Constraints, and sensible ones, not "time-varying relations".

Second, I would like to acknowledge, you are applying more Normalisation (without explaining it), and less FD resolution to the problem. Which of course elevates it from any solution or argument that is restricted to resolution of FDs.

Cheers
Derek

James K. Lowden

unread,

Feb 13, 2013, 7:57:08 PM2/13/13

to

On Tue, 12 Feb 2013 07:01:14 -0800 (PST)
Derek Asirvadem <derek.a...@gmail.com> wrote:

> *** These minor points do not subtract from the value of your post,
> which is good. ***

Thank you.

> Codd's "time-varying relations" are specifically about consistency.

I'm not sure why you say that? I was thinking of, for example,

"The totality of data in a data bank may be viewed as a
collection of time-varying relations. These relations are of assorted
degrees. As time progresses, each n-ary relation may be subject to
insertion of additional n-tuples, deletion of existing ones, and
alteration of components of any of its existing n-tuples."

I know I don't have to tell you where that comes from, but for those
following along at home, it's from "A Relational Model of Data for
Large Shared Data Banks", section 1.3. "A Relational View of Data".

--jkl

Derek Asirvadem

unread,

Feb 14, 2013, 4:47:27 PM2/14/13

to

On Thursday, 14 February 2013 11:57:08 UTC+11, James K. Lowden wrote:

>
> On Tue, 12 Feb 2013 07:01:14 -0800 (PST) Derek Asirvadem wrote:
>
> > Codd's "time-varying relations" are specifically about consistency.
>
> I'm not sure why you say that? I was thinking of, for example,
>

> < one sentence from the RM snipped >

Yeah, sure. Codd used the term five times in that paper.

There is no point quoting one, or the other, or all five, sentences in which he used the term. Doing that would be considering the sentences in isolation. The first two uses are introductions of the concept. The next three uses, in chapter 2 Redundancy & Consistency, are specifically about Database Consistency. Since I specifically mention Consistency, obviously I am referring to the Consistency in chapter 2, and not to the term as a general concept. Breaking up the paper is probably not a good thing. Understanding the whole, and specifically what he meant by "time-varying relations" in chapter 2, may be a good thing to do.

I did not state an isolated sentence:

> > Codd's "time-varying relations" are specifically about consistency.

I stated:

> > Codd's "time-varying relations" are specifically about consistency. Your post is illustrative enough without making the reference. Your answer is about Constraints, and sensible ones, not "time-varying relations".

Rather than "why do you say that?", perhaps it is up to you, since you made the reference, and it is being questioned, to identify how your example is related to Codd's "time-varying relations" and Consistency requirements thereof. If you were using the term in a general sense, which now appears to be the case, given that you have quoted the first introduction of it, and not in the particular sense that Codd discussed it in detail (Database Consistency), then just say so.

> ... for those following along at home ...

I don't think anyone cares about the skirmishes you are having with me about minor issues.

Cheers
Derek

James K. Lowden

unread,

Feb 14, 2013, 7:23:09 PM2/14/13

to

On Thu, 14 Feb 2013 13:47:27 -0800 (PST)
Derek Asirvadem <derek.a...@gmail.com> wrote:

> There is no point quoting one, or the other, or all five, sentences
> in which he used the term. Doing that would be considering the
> sentences in isolation.

The simple fact is that any sentence *can* be considered isolation.
That is how formal logic proceeds, sentence by sentence.

> Codd's "time-varying relations" are specifically about consistency.

That sentence is false, because the term "time-varying relations"
refers to concepts other than consistency.

Happy to be of service.

--jkl

Derek Asirvadem

unread,

Feb 15, 2013, 3:20:20 AM2/15/13

to

On Friday, 15 February 2013 11:23:09 UTC+11, James K. Lowden wrote:

>
> On Thu, 14 Feb 2013 13:47:27 -0800 (PST) Derek Asirvadem wrote:
>
> > There is no point quoting one, or the other, or all five, sentences
> > in which he used the term. Doing that would be considering the
> > sentences in isolation.
>
> The simple fact is that any sentence *can* be considered isolation.
> That is how formal logic proceeds, sentence by sentence.

Of course.

That is great when you need to apply stepwise logic to a problem. But humans, and databases these days are not limited to binary stepwise progression (evidently you are). There is context; there is the fact that the post at which you hijacked this thread is an implementation, not a test of abstract logic; therefore the attempt to apply it to that concrete context and set of facts is, well, not logical. I do not have the time to define logic to you.

You are absolutely stuck in it, granted. I can own imagine the problems you have approached the real world with a stepwise, abstract mindset.

The reason it is dishonest is, the original reference (yours) was made about Eric's post; which I commented about; and now you have taken the single sentence out, and presented it in isolation. That may be "logic" to you, and the way "logic" was taught in class, but that simply isn't logic to humans who have not lost the context. Certainly lawyers try that in court, but they do not get away with it.

Now you *could* just post your opinion, without reference to mine, but you don't. No, you isolate one sentence and attack it (not me the person, but my sentence). Which leads to long posts trying to explain something across the canyon. I did try, but now I am giving up.

> > Codd's "time-varying relations" are specifically about consistency.
>
> That sentence is false, because the term "time-varying relations"
> refers to concepts other than consistency.

See, there is another example of the same thing. Only a complete nutcase needs to be told that I did not state that; that that (posing the opposite) does not make the statement false; that I stated something specific to the context of that post, and nothing else. But you argue the nothing else. Sickening.

If I don't answer your posts, it means I have not opened them. Too much demand of my time, to step through the mud (your "logic", given the absence of the human variety), and we never get to the good stuff.

Goodbye
Derek

Jan Hidders

unread,

Mar 6, 2013, 9:52:40 AM3/6/13

to

On Sunday, 10 February 2013 14:14:42 UTC+1, hugoko...@gmail.com wrote:
> Op zaterdag 9 februari 2013 12:50:46 UTC+1 schreef Norbert_Paul het volgende:
>
> > There is an algorithm thad can do this for you:
>
> >
>
> > Google the terms "synthesis algorithm third normal form".
>
>
>
> Hi Norbert,
>
>
>
> Thanks for the pointer. Very useful!
>
>
>
> I used the algorithm on my set of FDs and this resulted in three relations:
>
>
>
> R1: ABCD (ABC --> D)
>
> R2: ABE (AB --> E)
>
> R3: BCE (BC --> E)
>
>
>
> Interestingly, that was the first thing I came up with when thinking about this problem. I thought it would be incorrect because
>
> (a) is has an update anomaly (changing an E value in R2 without making the accompanying change in R3 would result in inconsistent data); and
>
> (b) R3 is redundant, it can be reconstructed by joining R1 and R2.

Hello Hugo,

This is a very nice point, and one that is AFAIK indeed overlooked by many if not all textbooks. But it is also somewhat academic. In some sense you are adding an extra constraint to your database schema that could be formulated as:

"If I take the natural joins of all my relations R1, ..., Rn, and then project the result on a header Hi of Ri, I obtain again Ri."

or equivalently:

"all my relations in my database schema are projections of what was once a single relation"

This holds of course trivially if you have only one relation. Under this constraint your observations are right. But it is usually assumed that this constraint does not really apply and that it is an artifact of the schema that you (perhaps mistakenly) chose as your starting point. So the fact that the normalized schema is in some sense more liberal, because it does not have that constraint, is usually considered a feature and not a bug.

-- Jan Hidders

Norbert_Paul

unread,

Mar 7, 2013, 1:21:38 AM3/7/13

to

Hi Jan,

Are you the Jan Hidders that has written this paper:
"An Isotopic Invariant for Planar Drawings of Connected Planar Graphs (1995)"?

Norbert

Jan Hidders wrote:
> -- Jan Hidders

Jan Hidders

unread,

Mar 7, 2013, 2:44:24 AM3/7/13

to

Op donderdag 7 maart 2013 07:21:38 UTC+1 schreef Norbert_Paul het volgende:

> Hi Jan,
>
>
>
> Are you the Jan Hidders that has written this paper:
>
> "An Isotopic Invariant for Planar Drawings of Connected Planar Graphs (1995)"?

Hi Norbert,

Yes, it's been a while, but that was definitely me. `;-)

How on earth did you bump into that paper?

-- Jan Hidders

Norbert_Paul

unread,

Mar 7, 2013, 3:08:51 AM3/7/13

to

Jan Hidders wrote:
> Op donderdag 7 maart 2013 07:21:38 UTC+1 schreef Norbert_Paul het volgende:
>> Hi Jan,

> How on earth did you bump into that paper?
>
> -- Jan Hidders

That was at the start of my scientific career. I made the nD generalization
of PLV about 10 years ago. Remember?
http://digbib.ubka.uni-karlsruhe.de/volltexte/3882003

Norbert

Norbert_Paul

unread,

Mar 7, 2013, 3:16:05 AM3/7/13

to

Jan Hidders wrote:
> Op donderdag 7 maart 2013 07:21:38 UTC+1 schreef Norbert_Paul het volgende:
>> Hi Jan,

> How on earth did you bump into that paper?
>
> -- Jan Hidders

That was at the start of my scientific career. I made the nD generalization

of PLA about 10 years ago. Remember?
http://digbib.ubka.uni-karlsruhe.de/volltexte/3882003

Norbert

Jan Hidders

unread,

Mar 7, 2013, 3:21:35 AM3/7/13

to

Op donderdag 7 maart 2013 09:08:51 UTC+1 schreef Norbert_Paul het volgende:

Right! Yes, I remember. Sorry for not being more supportive at the time. I was visiting UBC in Vancouver at the time and really busy working on XPath stuff that needed finishing before I would leave. Nice work. You are stil at KIT?

-- Jan Hidders

Erwin

unread,

Mar 15, 2013, 7:27:55 PM3/15/13

to

Op vrijdag 8 februari 2013 21:33:27 UTC+1 schreef hugoko...@gmail.com het volgende:

You have run into the issue of dependency preservation.

No, there is no better way.

Moving to 3NF/BCNF makes at least one of the FDs inexpressible, meaning it has to be reinstated as a constraint, and SQL has no support for enforcing that kind of constraint declaratively, meaning that enforcing it needs code.

That's how simple the "solution" is.

Jan Hidders

unread,

Mar 16, 2013, 7:08:12 AM3/16/13

to

Actually, as was already pointed out, the usual dependency-preserving
3NF synthesis algorithm gives you the following decomposition:

R1(a, b, c, d)
R2(a, b, e)
R3(b, c, e)

And that is in fact in BCNF. But as was also already observed, there is
possibly some redundancy here.

-- Jan Hidders

Erwin

unread,

Apr 30, 2013, 10:21:47 AM4/30/13

to

Op zaterdag 16 maart 2013 12:08:12 UTC+1 schreef Jan Hidders het volgende:

And an ugly set of constraints to control it.

In particular, that at all times R1 JOIN R2 === R1 JOIN R3.

No problem if you have CREATE ASSERTION (plus multiple assignment or its inferior nephew deferred constraint checking), but if enforcement is up to the programmer, it's not the simplest thing to achieve.

Erwin

unread,

Apr 30, 2013, 10:41:41 AM4/30/13

to

Op dinsdag 30 april 2013 16:21:47 UTC+2 schreef Erwin het volgende:

On second thought.

I've made the same mistake a couple more times before (of claiming that "you can't preserve dependencies" when in fact it should have been "you can't preserve dependencies, not without _introducing_ redundancy"). Thanks for reminding me.

It's odd to think of normalization theory, which was basically devised for _eliminating_ redundancy, as sometimes "requiring" to _introduce_ it ...

Jan Hidders

unread,

Apr 30, 2013, 5:39:58 PM4/30/13

to

True. But on the other hand, this type of restriction is often an
artifact of the initial table design and might not be a valid business
rule per se. And the getting rid of it is actually a feature and not a
bug.

-- Jan Hidders

Erwin

unread,

Apr 30, 2013, 6:17:44 PM4/30/13

to

Op dinsdag 30 april 2013 23:39:58 UTC+2 schreef Jan Hidders het volgende:

Also true :-)

It's often bothered me that normalization theory/procedure seems to quietly ignore the notion of "nullability or not" of any of the attributes in the "initial table design" ...

Iow, that normalization theory per se doesn't actually allow to determine when and when not the phenomenon is "an artifact of the iniital table design".

Yet iow, that normalization theory doesn't seem to bother whether or not the "initial table design" is in fact an accurate relational representation of the business problem ...

Jan Hidders

unread,

May 1, 2013, 5:20:47 AM5/1/13

to

On 2013-04-30 22:17:44 +0000, Erwin said:

> It's often bothered me that normalization theory/procedure seems to
> quietly ignore the notion of "nullability or not" of any of the
> attributes in the "initial table design" ...

Nullability doesn't really matter for normalization. From a redundancy
point of view null is just another value.

> Iow, that normalization theory per se doesn't actually allow to
> determine when and when not the phenomenon is "an artifact of the
> iniital table design".
>
> Yet iow, that normalization theory doesn't seem to bother whether or
> not the "initial table design" is in fact an accurate relational
> representation of the business problem ...

Keeping the schema equivalent in information content wasn't a goal
anyway, and it shouldn't be, otherwise we could not remove certain
update anomalies. And I would argue that in practice the different
components usually turn out to be independent facts anyway. The fact
that they have different underlying dependencies already sort of hints
at that. But if you have a realistic example that would show otherwise,
that would be interesting.

-- Jan Hidders

Erwin

unread,

May 1, 2013, 9:08:38 AM5/1/13

to

Op woensdag 1 mei 2013 11:20:47 UTC+2 schreef Jan Hidders het volgende:

> On 2013-04-30 22:17:44 +0000, Erwin said:
>
>
>
> > It's often bothered me that normalization theory/procedure seems to
>
> > quietly ignore the notion of "nullability or not" of any of the
>
> > attributes in the "initial table design" ...
>
>
>
> Nullability doesn't really matter for normalization. From a redundancy
>
> point of view null is just another value.

Well, fair enough I suppose, but not without noting that in this particular context, null is then supposed to be a value that is at least equal to itself :-)

>
>
>
> > Iow, that normalization theory per se doesn't actually allow to
>
> > determine when and when not the phenomenon is "an artifact of the
>
> > iniital table design".
>
> >
>
> > Yet iow, that normalization theory doesn't seem to bother whether or
>
> > not the "initial table design" is in fact an accurate relational
>
> > representation of the business problem ...
>
>
>
> Keeping the schema equivalent in information content wasn't a goal
>
> anyway,

I agree that it wasn't.

> and it shouldn't be,

But I think I disagree quite vehemently that it shouldn't ... in our age.

> otherwise we could not remove certain
>
> update anomalies.

"Update anomalies" are a laudable concept to consider, and eliminating them a laudable goal to try and achieve.

Normalization theory does quite a good job at that, as far as it goes, and that's good.

But I think (as of this point I'm nowhere near sure of what I'm saying so pls take with considerable slack and tons of salt) the concept of "update anomalies" are only a proper subset of the "difficulties for the database user to update the database correctly". That's because :

(a) all FDs correspond to constraints in the ultimate logical design,
(b) not all constraints correspond to FDs
(c) and the "difficulties to update the database correctly" derive from _all_ the constraints, not just from those that correspond to FDs.

Therefore, "minimizing the number of potential update anomalies" cannot and does not necessarily imply "minimizing the number of potential constraint violations a user can get when updating the database".

And the latter is exactly what I want, and what I think is needed, in database design. Because achieving _that_ implies the minimal number of considerations to make about constraints for the programmer writing the update statements. (and I think it is indeed _that_ which should be "minimized".)

To exemplify : if I were to run the OP's example on SIRA_PRISE, which has both multiple assignment and arbitrary complexity database constraints, I would _never_ opt for the three-table design that comes out of your dependency preserving normalization (and I would frown quite heavily at someone who would) !!!

I might _even_ leave the 5-column thing simply as is, define two projection views {b,c,e} and {c,d,e} on it and declare the appropriate keys {b,c} {c,d} on those projection views !!!!!!!

> And I would argue that in practice the different
>
> components usually turn out to be independent facts anyway. The fact
>
> that they have different underlying dependencies already sort of hints
>
> at that. But if you have a realistic example that would show otherwise,
>
> that would be interesting.

Not sure if I'm thinking of the same kind of thing as you, but I'm thinking of CUSTOMERs that have an AGE, where AGE determines [some kind of] COMMERCIAL _CLASS.

AGE and COMMERCIAL_CLASS don't become facts that are "independent" from there being a CUSTOMER to begin with.

Erwin

unread,

May 1, 2013, 9:12:37 AM5/1/13

to

Op woensdag 1 mei 2013 15:08:38 UTC+2 schreef Erwin het volgende:

> Not sure if I'm thinking of the same kind of thing as you, but I'm thinking of CUSTOMERs that have an AGE, where AGE determines [some kind of] COMMERCIAL _CLASS.
>
>
>
> AGE and COMMERCIAL_CLASS don't become facts that are "independent" from there being a CUSTOMER to begin with.

Bad example. Disregard pls.

Jan Hidders

unread,

May 1, 2013, 11:38:17 AM5/1/13

to

There we probably differ. It might be that each class was introduced
because of certain clients that we had in the past, but that does not
logically exclude the possibility that at some point in the future we
have for a brief period no clients of that class. In that case we would
not want to lose the link between age and that class. So a separate
age-class table would be a good design here.

-- Jan Hidders

Erwin

unread,

May 1, 2013, 12:08:17 PM5/1/13

to

Op woensdag 1 mei 2013 17:38:17 UTC+2 schreef Jan Hidders het volgende:

:-)

You responded only to the piece of which I wrote myself :

"Bad example. pls disregard."

Does the "_there_ we probably differ" imply then that you agree with the rest ?

James K. Lowden

unread,

May 1, 2013, 4:17:59 PM5/1/13

to

On Wed, 1 May 2013 11:20:47 +0200
Jan Hidders <hid...@gmail.com> wrote:

> On 2013-04-30 22:17:44 +0000, Erwin said:
>
> > It's often bothered me that normalization theory/procedure seems to
> > quietly ignore the notion of "nullability or not" of any of the
> > attributes in the "initial table design" ...
>
> Nullability doesn't really matter for normalization. From a
> redundancy point of view null is just another value.

I guess I don't know what a "redundancy point of view is", but

1. Null is not a value. Null is a mark noting that the value is
missing. I don't see how a non-value becomes a value depending on
one's point of view.

2. Null is not "just another" value. A nullable column by definition
has an optional relationship to the key. Any nullable column can be
placed in a separate table with an FK relationship of 0:1 or 0:N
cardinality.

AFAIK functional dependency analysis does not entertain nulls or
address missing information. The database designer should pay attention
to them, though, because they indicate the potential of more than one
kind of relation in a single table.

--jkl

Jan Hidders

unread,

May 1, 2013, 4:54:25 PM5/1/13

to

On 2013-05-01 20:17:59 +0000, James K. Lowden said:

> On Wed, 1 May 2013 11:20:47 +0200
> Jan Hidders <hid...@gmail.com> wrote:
>
>> On 2013-04-30 22:17:44 +0000, Erwin said:
>>
>>> It's often bothered me that normalization theory/procedure seems to
>>> quietly ignore the notion of "nullability or not" of any of the
>>> attributes in the "initial table design" ...
>>
>> Nullability doesn't really matter for normalization. From a
>> redundancy point of view null is just another value.
>
> I guess I don't know what a "redundancy point of view is", but
>
> 1. Null is not a value. Null is a mark noting that the value is
> missing. I don't see how a non-value becomes a value depending on
> one's point of view.

Strictly speaking it doesn't. But normalization theory does not care
about whether you have values or markers in your fields. Definitions
for redundancy, update anomalies, etc. stay the same.

> 2. Null is not "just another" value. A nullable column by definition
> has an optional relationship to the key. Any nullable column can be
> placed in a separate table with an FK relationship of 0:1 or 0:N
> cardinality.

Note that theoretically speaking that is also true for non-nullable
columns: you can als there create an information-equivalent schema that
way, even including the 0 lower bound for the FK relationship.

> AFAIK functional dependency analysis does not entertain nulls or
> address missing information. The database designer should pay attention
> to them, though, because they indicate the potential of more than one
> kind of relation in a single table.

True. That's actually a pretty good description of what normalization
is about: detecting whether more than one kind of relation is combined
into a relation and whether that is problematic or not.

-- Jan Hidders

Jan Hidders

unread,

May 1, 2013, 5:07:54 PM5/1/13

to

To some extent. I certainly agree with the observation that when
reasoning about update anomalies and redudancy we should take all
database constraints into account, not just functional dependencies.
But I still don't follow you where you state that in OP's original
example this is necessarily an issue. It could be, theoretically, but I
doubt it happens often in practice.

-- Jan Hidders

com...@hotmail.com

unread,

May 1, 2013, 9:35:15 PM5/1/13

to

On Tuesday, April 30, 2013 3:17:44 PM UTC-7, Erwin wrote:
> Op dinsdag 30 april 2013 23:39:58 UTC+2 schreef Jan Hidders het volgende:
>
> > On 2013-04-30 14:21:47 +0000, Erwin said:

> > >> R1(a, b, c, d)
> > >> R2(a, b, e)
> > >> R3(b, c, e)

> > > In particular, that at all times R1 JOIN R2 === R1 JOIN R3.

> > But on the other hand, this type of restriction is often an
> > artifact of the initial table design

No, it is not an artifact. The fact that it was a single-table schema is irrelevant to normalization. The normalized design will have such a constraint exactly when the original does. The original schema, involving a single table, is capable of representing any predicate on its attributes including the one with the constraint and the one without.

Eg: Given predicate R(a,b,c,d,e) the normalized relations have predicates
R1: EXISTS e R(a,b,c,d,e)
R2: EXISTS c,d R(a,b,c,d,e)
R3: EXISTS a,d R(a,b,c,d,e)
where R1(a,b,c,d) AND R2(a,b,e) AND R3(b,c,e) iff R(a,b,c,d,e). But people erroneously think the Ri can hold any old values. They come up with different predicates that instead of being of the form "...a...b...c...d...some e..." are of the form "...a...b...c...d...". When these mistaken predicates are conjoined they give a predicate that R(a,b,c,d,e) implies but might not be logically equivalent. It will be non-equivalent if the mistaken meanings differ from the correct meanings. Ie if it is possible for the "...a...b...c...d..." version to hold when there isn't some appropriate e value.

> > And the getting rid of it is actually a feature and not a
> > bug.

I agree that people come to understand that they had the wrong design. But the feature-not-bug situation is that it is noticed that the mistaken predicates and their conjunction are actually the correct design. Likely it is never noticed that the mistaken predicates are in fact mistaken, and the discrepancy between their conjunction and the original predicate is attributed, as by you and Erwin, to starting normalization from a single table. Normalization didn't have anything to do with the fact that the user chose an original predicate that was wrong for the design and that they misinterpreted the normalization-produced predicates as ones that were right for the design.

> It's often bothered me that normalization theory/procedure seems to quietly ignore the notion of "nullability or not" of any of the attributes in the "initial table design" ...
>
> Iow, that normalization theory per se doesn't actually allow to determine when and when not the phenomenon is "an artifact of the iniital table design".

There is no "phenomenon". Initially modeling by a universal relation does not itself cause "artifacts". Such a constraint either holds or doesn't in the original design and either correctly models the problem or doesn't, and normalization is independent of it. We are left with normalization just not happening to do something that we'd like done that we have no reason to expect it to do.

philip

Erwin

unread,

May 2, 2013, 6:40:28 AM5/2/13

to

Op donderdag 2 mei 2013 03:35:15 UTC+2 schreef com...@hotmail.com het volgende:

You've characterized my botherings with normalization stuff very well.

However, I'd argue that there is indeed "a phenomenon", and it's exactly the phenomenon you described as the "...a...b...c...d...some e..." vs. "...a...b...c...d..." confusion.

Jan observes that in practice most of the time, making this "mistake" leads people to "discovering" the correct predicates for the business problem at hand, and seems to be on the side that therefore it must be a good thing, I personally am rather bothered with the fact that a theory would be useful because misunderstanding it or only partially understanding it, in practice mostly leads to correct designs being derived from less correct ones ...

(I guess this must also be the reason why Fabian Pascal insists that if you do the conceptual modeling properly from the get-go, you won't even need normalization at all.)

Jan Hidders

unread,

May 2, 2013, 9:23:07 AM5/2/13

to

Well put, and I think we are not that far apart in these matters. I'm
certainly not in favor of applying the normalization algorithms blindly
and without understanding what it is that they do and do not achieve.
Understanding that the usual normalization algorithms do not
necessarily result in an equivalent schema is part of that. But most
textbooks I know are correct and clear on this and will state that the
goal is for example a lossless-join and dependency-preserving
decomposition.

A question I often ask to see if people really understand normalization
is the following:

The reason that is usually stated for not normalizing all the way up to
BCNF and stopping at 3NF is that you cannot always reach BCNF while
remaining dependency preserving. However, suppose we follow the
following procedure:

1. Apply the 3NF normalization procedure we discussed earlier

2. Include with each component only the FDs that generated that component

This gives you a lossless-join dependency-preserving decomposition. And
each component clearly is in BCNF for its included FDs. But wait, is
that not exactly the thing that was stated as impossible? So what gives
here?

> (I guess this must also be the reason why Fabian Pascal insists that if
> you do the conceptual modeling properly from the get-go, you won't even
> need normalization at all.)

In practice that is often true, but on the other hand it is
theoreticallly possible that even in a correct conceptual model there
are some normalization issues left. And apart from that I would argue
that normalization theory is important to understand what happens if
you start denormalizing in your implementation. And in practice it
happens that you have to start working on an existing system where you
need the theory to determine if and why the exisiting implementation is
problematic, and if it is what you can do about it.

-- Jan Hidders

com...@hotmail.com

unread,

May 4, 2013, 7:28:01 PM5/4/13

to

On Thursday, May 2, 2013 3:40:28 AM UTC-7, Erwin wrote:

> > > It's often bothered me that normalization theory/procedure seems to quietly ignore the notion of "nullability or not" of any of the attributes in the "initial table design" ...
> > > Iow, that normalization theory per se doesn't actually allow to determine when and when not the phenomenon is "an artifact of the iniital table design".

> However, I'd argue that there is indeed "a phenomenon", and it's exactly the phenomenon you described as the "...a...b...c...d...some e..." vs. "...a...b...c...d..." confusion.

You have changed the meaning of "phenomenon" from some alleged constraint imposed by an original single-relation schema to certain difficulties with certain design processes or their presentation. (Presumably you were concerned with the latter but hypothesized the former as the cause.)

> Jan observes that in practice most of the time, making this "mistake" leads people to "discovering" the correct predicates for the business problem at hand, and seems to be on the side that therefore it must be a good thing, I personally am rather bothered with the fact that a theory would be useful because misunderstanding it or only partially understanding it, in practice mostly leads to correct designs being derived from less correct ones ...

I agree with you that most (all?) presentations of db design and of normalization and the role of the latter in the former are poor. But you seem to confuse the latter with the former.

Eg normalization addresses replacing some relations with others but doesn't generally address information equivalence and constraint preservation, which is different from non-loss decomposition or FD preservation.

Eg when you normalize a relation variable based on FDs you have have to know its FDs so you have to know its predicate. (If you didn't have have a predicate in mind, why would you even have a relation variable?) So normalization cannot be expected to generate proper predicates. I agree a design process, incorporating normalization, should.

Eg if your only predicate/variable on attributes A1,..,An is p(A1,...,An) then the only predicates you can express on a subset of Ai,... involve "exists Ai,... p(A1,...,An)". For any tuple to be in its value there has to be corresponding supertuple(s) in p. So if you find that you want to express predicates when there *doesn't* have to be all n values (or corresponding things) then this predicate is inadequate. If you don't bother to think about what predicates you are interested in ie queries you want to pose and only bump into it when normalization presents you with predicates on attribute subsets (what did you expect it to present you with?) then that is not a fault of normalization.

philip

com...@hotmail.com

unread,

May 4, 2013, 7:32:18 PM5/4/13

to

On Thursday, May 2, 2013 6:23:07 AM UTC-7, Jan Hidders wrote:
> But most
> textbooks I know are correct and clear on this and will state that the
> goal is for example a lossless-join and dependency-preserving
> decomposition.

Can you name some? Because I don't think books are clear about normalization within design, which involves generated predicates and constraint preservation. Normalization per se assumes the generated components are projections of the original, ie are suitably constrained, without addressing the management of those constraints (eg introducing FKs for lost FDs).

> 1. Apply the 3NF normalization procedure we discussed earlier
> 2. Include with each component only the FDs that generated that component

It's not clear to me what you are describing or how it relates to arguing against stopping at 3nf.

An information-equivalent FD-anomoly-free bcnf and even JD-anomoly-free 5nf schema is always possible with suitable inter-component constraints. So decomposition to 3nf as a goal is just wrong/confused.

philip

com...@hotmail.com

unread,

May 5, 2013, 3:18:22 AM5/5/13

to

On Saturday, May 4, 2013 4:28:01 PM UTC-7, com...@hotmail.com wrote:
> Eg if your only predicate/variable on attributes A1,..,An is p(A1,...,An) then the only predicates you can express on a subset of Ai,... involve "exists Ai,... p(A1,...,An)".

Oops, that's "exists ...all but Ai,... p(A1,...,An)".

philip

Jan Hidders

unread,

May 5, 2013, 4:10:49 AM5/5/13

to

On 2013-05-04 23:32:18 +0000, com...@hotmail.com said:

> On Thursday, May 2, 2013 6:23:07 AM UTC-7, Jan Hidders wrote:
>> But most> textbooks I know are correct and clear on this and will state
>> that the> goal is for example a lossless-join and
>> dependency-preserving> decomposition.
>
> Can you name some? Because I don't think books are clear about
> normalization within design, which involves generated predicates and
> constraint preservation. Normalization per se assumes the generated
> components are projections of the original, ie are suitably
> constrained, without addressing the management of those constraints (eg
> introducing FKs for lost FDs).
>
>> 1. Apply the 3NF normalization procedure we discussed earlier
>> 2. Include with each component only the FDs that generated that component
>
> It's not clear to me what you are describing or how it relates to
> arguing against stopping at 3nf.

I've described to you a normalization algorithm that seems to produce a
schema in BCNF that is also lossless-join and dependency preserving,
but also has no inter-component dependencies. That's not supposed to be
possible according to the usual normalization explanation. So what went
wrong here?

-- Jan Hidders

Jan Hidders

unread,

May 5, 2013, 5:35:46 AM5/5/13

to

Agreed, although I'd add that FD preservation at least partially covers
constraint preservation.

But the main point of contention is whether in practice this is really
such a big issue. I'd certainly agree that in some particular cases it
can be important that your normalizing schema transformation results in
an information equivalent schema, and I also agree that schema
designers who are doing normalization should be aware of the fact that
this does not always happen with the usual normalization algorithms.

> Eg when you normalize a relation variable based on FDs you have have to
> know its FDs so you have to know its predicate. (If you didn't have
> have a predicate in mind, why would you even have a relation variable?)
> So normalization cannot be expected to generate proper predicates. I
> agree a design process, incorporating normalization, should.

This is a somewhat odd line of reasoning. It is not true that the
schema designer can only think in terms of predicates that are
associated with the relvars of the original database schema. So
normalization *can* be expected to generate new predicates that are not
expressible in terms of the old ones. And in practice it often does. Of
course you are free to redefine the process of normalization so that
this is not true, but then I'm afraid you would be becoming too
theoretical in the bad sense of the word.

The key question still is if being information equivalent (rather then
just being lossless) should be the main goal when redesigning your
schema / normalizing. You seem to think it is, but in my experience
this is almost never the case. So I'm curious to know on what it is
that you are basing that claim.

-- Jan Hidders

Erwin

unread,

May 5, 2013, 3:32:35 PM5/5/13

to

Op zondag 5 mei 2013 10:10:49 UTC+2 schreef Jan Hidders het volgende:

That's simply not true is it ?

You are referring to the three-table decomposition, right ?

That one does have the inter-component rule that R1 JOIN R2 === R1 JOIN R3 (as I pointed out), no ?

"What went wrong" imo is that your decomposition did not match the original "spirit" of the operation. In which relation schemas get split in _exactly_ two, with the dependent attributes of the "small" FD (bc->e) moved _out of_ the schema. your three-way split is either not a split in exactly two, or it is two subsequent splits-in-two where the first does _not_ move any attribute "out of the schema" (precisely because it is still needed for the second split).

Jan Hidders

unread,

May 5, 2013, 6:52:44 PM5/5/13

to

On 2013-05-05 19:32:35 +0000, Erwin said:

> Op zondag 5 mei 2013 10:10:49 UTC+2 schreef Jan Hidders het volgende:
>> On 2013-05-04 23:32:18 +0000, ... said:
>>
>>
>>
>>> On Thursday, May 2, 2013 6:23:07 AM UTC-7, Jan Hidders wrote:
>>
>>>> But most> textbooks I know are correct and clear on this and will
>>>> state>> >> that the> goal is for example a lossless-join and>> >>
>>>> dependency-preserving> decomposition.
>>
>>>
>>
>>> Can you name some? Because I don't think books are clear about>> >
>>> normalization within design, which involves generated predicates and>>
>>> > constraint preservation. Normalization per se assumes the generated>>
>>> > components are projections of the original, ie are suitably>> >
>>> constrained, without addressing the management of those constraints
>>> (eg>> > introducing FKs for lost FDs).
>>
>>>
>>
>>>> 1. Apply the 3NF normalization procedure we discussed earlier
>>
>>>> 2. Include with each component only the FDs that generated that component
>>
>>>
>>
>>> It's not clear to me what you are describing or how it relates to>> >
>>> arguing against stopping at 3nf.
>>
>>
>>
>> I've described to you a normalization algorithm that seems to produce
>> a>> schema in BCNF that is also lossless-join and dependency
>> preserving,>> but also has no inter-component dependencies. That's not
>> supposed to be>> possible according to the usual normalization
>> explanation. So what went>> wrong here?
>

> That's simply not true is it ?
>
> You are referring to the three-table decomposition, right ?

No. I'm referring to the algorithm that is described by steps 1. and 2.
above. It's an adapted variant of the usual algorithm that produced the
three-table decomposition. That older algorithm does not always produce
a schema in in BCNF. Mine seems to do so.

> That one does have the inter-component rule that R1 JOIN R2 === R1 JOIN
> R3 (as I pointed out), no ?

Only if you want to be information-equivalent but conventional
normalization does not require that. It only aims at obtaining a
lossless decomposition. What I am now pointing out is that if that is
our goal, it seems actually possible to always normalize to BCNF
without introducing inter-component dependencies. That cannot be right
now, can it? So what then is exactly meant with the conventional claim
that this is not possible?

> "What went wrong" imo is that your decomposition did not match the
> original "spirit" of the operation. In which relation schemas get
> split in _exactly_ two, with the dependent attributes of the "small" FD
> (bc->e) moved _out of_ the schema. your three-way split is either not
> a split in exactly two, or it is two subsequent splits-in-two where the
> first does _not_ move any attribute "out of the schema" (precisely
> because it is still needed for the second split).

Note that I was applying one of the usual textbook 3NF normalization
algorithms. Normalizing stepwise with pair-wise splits is not the only
option and will in fact sometimes not allow you to find a perfectly
valid 3NF decomposiition.

-- Jan Hidders

Erwin

unread,

May 6, 2013, 8:26:10 AM5/6/13

to

Op maandag 6 mei 2013 00:52:44 UTC+2 schreef Jan Hidders het volgende:

> On 2013-05-05 19:32:35 +0000, Erwin said:
>
>
>
>
>
>
> > That one does have the inter-component rule that R1 JOIN R2 === R1 JOIN
>
> > R3 (as I pointed out), no ?
>
>
>
> Only if you want to be information-equivalent but conventional
>
> normalization does not require that. It only aims at obtaining a
>
> lossless decomposition.

Well there's the whole contradiction isn't it.

How can deliberate admission of "information differences" coexist with "aims of being lossless" ?

If "information differences" are deliberately allowed, this can mean either or both of two things : something may be added, something may be lost.

I contend that the "lossless" in conventional normalization can be upheld only for such a perverse and narrow meaning of the term that the claim (of having "lossless" decompositions) becomes essentially meaningless/futile/irrelevant/...

Lossless then seems to mean purely (exaggerating a little bit) that "all the original attributes are still in the design, somewhere, somehow". Relation schemas as mere sets of attributes, and normalization as manipulation/set-algebraic computation on sets of sets of attributes, and the only restriction being that the union of the final sets of attributes must be equal to the original set of attributes (/ union of sets of attributes).

Jan Hidders

unread,

May 6, 2013, 11:18:38 AM5/6/13

to

On 2013-05-06 12:26:10 +0000, Erwin said:

> Op maandag 6 mei 2013 00:52:44 UTC+2 schreef Jan Hidders het volgende:
>> On 2013-05-05 19:32:35 +0000, Erwin said:
>>
>>> That one does have the inter-component rule that R1 JOIN R2 === R1
>>> JOIN>> > R3 (as I pointed out), no ?
>>
>> Only if you want to be information-equivalent but conventional>>
>> normalization does not require that. It only aims at obtaining a>>
>> lossless decomposition.
>
> Well there's the whole contradiction isn't it.
>
> How can deliberate admission of "information differences" coexist with
> "aims of being lossless" ?
>
> If "information differences" are deliberately allowed, this can mean
> either or both of two things : something may be added, something may be
> lost.

Well, "lossless" means nothing is lost, but, yes, something is allowed
to be added, i.e., the new schema might be able to represent
information that the old one could not.

> I contend that the "lossless" in conventional normalization can be
> upheld only for such a perverse and narrow meaning of the term that the
> claim (of having "lossless" decompositions) becomes essentially
> meaningless/futile/irrelevant/...

That's an .. er .. interesting claim. :-)

> Lossless then seems to mean purely (exaggerating a little bit) that
> "all the original attributes are still in the design, somewhere,
> somehow". Relation schemas as mere sets of attributes, and
> normalization as manipulation/set-algebraic computation on sets of sets
> of attributes, and the only restriction being that the union of the
> final sets of attributes must be equal to the original set of
> attributes (/ union of sets of attributes).

Nope.That's a necessary condition, but certainly not a sufficient condition.

-- Jan Hidders

nvitac...@gmail.com

unread,

May 6, 2013, 11:19:36 AM5/6/13

to

On Monday, May 6, 2013 12:52:44 AM UTC+2, Jan Hidders wrote:

> >>>> 1. Apply the 3NF normalization procedure we discussed earlier
> >>>> 2. Include with each component only the FDs that generated that component
>
> >>> It's not clear to me what you are describing or how it relates to>> >
> >>> arguing against stopping at 3nf.
>
> >> I've described to you a normalization algorithm that seems to produce
> >> a>> schema in BCNF that is also lossless-join and dependency
> >> preserving,>> but also has no inter-component dependencies. That's not
> >> supposed to be>> possible according to the usual normalization
> >> explanation. So what went>> wrong here?
>
> > That's simply not true is it ?
> > You are referring to the three-table decomposition, right ?
>
> No. I'm referring to the algorithm that is described by steps 1. and 2.
> above. It's an adapted variant of the usual algorithm that produced the
> three-table decomposition. That older algorithm does not always produce
> a schema in in BCNF. Mine seems to do so.

Hi, this is my first post in this group. I hope you won't mind if I dive into this very interesting discussion.

One exercise that I usually give to my students seems to fit exactly the above. First, I ask them to prove that the schema R(A,B,C,D,E) with FDs F = {AB -> C, C -> D, D -> B, E ->D} is not in 3NF (which they usually do easily). Then I ask them to decompose it in 3NF and they typically come out with:

R1(A,B,C), {AB -> C}
R2(C,D), {C -> D}
R3(D,B), {D -> B}
R4(D,E), {E -> D}
R5(A,E), {}

Finally I ask them: does this decomposition preserve FDs? (“yes, by construction”). Is C -> B implied by F? (“yes, by transitivity”). Is C -> B preserved or not in the decomposition? (panic) :)

Nicola

Erwin

unread,

May 6, 2013, 11:23:41 AM5/6/13

to

Op maandag 6 mei 2013 00:52:44 UTC+2 schreef Jan Hidders het volgende:

Previous reply written in too much of a hurry.

I'll repeat the running {abcde} example with { ... bc->e cd->e }.

The way I understand it, conventional normalization procedure would take, say, bc->e, introduce a relvar {bce} in which that FD was to hold, _and then reduce the original schema_ by the RHS of that FD, {e}.

"Your" normalization procedure, would take the same FD, introduce the same relvar {bce}, but it would then do a different kind of "reduction" on the original schema. Instead of "removing the RHS of the FD just 'processed'", without any consideration for remaining FDs in the original schema, you are now first going to consider the remaining FDs in the original schema, and remove _only those attributes_ from the RHS of the 'processed' FD that are no longer mentioned in the remaining FDs at all.

In the running example, you are _not_ going to remove {e} because e is still mentioned in cd->e, hence you are going to leave the original {abcde} schema unaltered. Only after you've 'processed' cd->e as well, will {e} be "removed" from the original schema.

I presume that such "degenerate decompositions", in which an extra relvar is introduced and the original schema left unaltered, were never considered. If that sort of answers your actual question (meaning I've probably grasped the essence of what your actual question is), I'm tempted to suggest that for a definitive answer, the question would have to be posed to the Chris Dates and/or Ron Fagins of this world. They're the ones who "were there when it happened".

Jan Hidders

unread,

May 6, 2013, 11:44:46 AM5/6/13

to

On 2013-05-06 15:19:36 +0000, nvitac...@gmail.com said:

> On Monday, May 6, 2013 12:52:44 AM UTC+2, Jan Hidders wrote:
>
>>>>>> 1. Apply the 3NF normalization procedure we discussed earlier> >>>> 2.
>>>>>> Include with each component only the FDs that generated that component
>>
>>>>> It's not clear to me what you are describing or how it relates to>> >
>>>>> arguing against stopping at 3nf.
>>
>>>> I've described to you a normalization algorithm that seems to produce
>>>> a>> schema in BCNF that is also lossless-join and dependency> >>
>>>> preserving,>> but also has no inter-component dependencies. That's not>
>>>> >> supposed to be>> possible according to the usual normalization> >>
>>>> explanation. So what went>> wrong here?
>>
>>> That's simply not true is it ?> > You are referring to the three-table
>>> decomposition, right ?
>>
>> No. I'm referring to the algorithm that is described by steps 1. and 2.
>> above. It's an adapted variant of the usual algorithm that produced
>> the> three-table decomposition. That older algorithm does not always
>> produce> a schema in in BCNF. Mine seems to do so.
>
> Hi, this is my first post in this group. I hope you won't mind if I
> dive into this very interesting discussion.

Please do. The water appears to be very nice today. :-)

> One exercise that I usually give to my students seems to fit exactly
> the above. First, I ask them to prove that the schema R(A,B,C,D,E) with
> FDs F = {AB -> C, C -> D, D -> B, E ->D} is not in 3NF (which they
> usually do easily). Then I ask them to decompose it in 3NF and they
> typically come out with:
>
> R1(A,B,C), {AB -> C}
> R2(C,D), {C -> D}
> R3(D,B), {D -> B}
> R4(D,E), {E -> D}
> R5(A,E), {}
>
> Finally I ask them: does this decomposition preserve FDs? (“yes, by
> construction”). Is C -> B implied by F? (“yes, by transitivity”). Is C
> -> B preserved or not in the decomposition? (panic) :)

Precisely! So you have identified my "error". :-) The FD C-> B should
apparently be added to T1, and so it is not in BCNF, only in 3NF. But
why not allow it to be omitted? It is reintroduced anyway if we join
everything back together again, and apparently we are fine if the
resulting schema is a bit more liberal then the old schema anyway. So
what is the underlying rationale / philosophy here that dicttates that
we include it in R1 and thereby prevents us from reaching BCNF?

-- Jan Hidders

Jan Hidders

unread,

May 6, 2013, 12:07:39 PM5/6/13

to

> Previous reply written in too much of a hurry.
>
> I'll repeat the running {abcde} example with { ... bc->e cd->e }.
>
> The way I understand it, conventional normalization procedure would
> take, say, bc->e, introduce a relvar {bce} in which that FD was to
> hold, _and then reduce the original schema_ by the RHS of that FD, {e}.

That's one of the conventional normalization procedures. I was using a
slightly more advanced and powerful one. See slide 23 on:

http://www.cs.sfu.ca/CourseCentral/354/jpei/slides/UsingBCNF-3NF.pdf

As you can see it doesn't work by splitting off dependencies one by one.

-- Jan Hidders

nvitac...@gmail.com

unread,

May 6, 2013, 1:33:51 PM5/6/13

to

Very well said!

> what is the underlying rationale / philosophy here that dicttates that
> we include it in R1 and thereby prevents us from reaching BCNF?

I think the idea is that C -> B “applies” to R1(A,B,C) because it is defined on a subset of R1's attributes. We can preserve it “for free” by projecting it to R1, instead of indirectly by joining with other schemes, and one may argue that is usually what is wanted. But, honestly, my arguments have never convinced me too much :) I'm curious about your answer.

Nicola

Message has been deleted

Erwin

unread,

May 6, 2013, 3:27:19 PM5/6/13

to

Op maandag 6 mei 2013 17:19:36 UTC+2 schreef nvitac...@gmail.com het volgende:

If C->B applies, then how is R1 in 3NF ?

Looking at the relation schema for R1, sole candidate key is AB. The applicable FD's LHS is completely outside the candidate key.

Erwin

unread,

May 6, 2013, 3:48:10 PM5/6/13

to

Op maandag 6 mei 2013 21:27:19 UTC+2 schreef Erwin het volgende:

>
>
>
> If C->B applies, then how is R1 in 3NF ?
>
>
>
> Looking at the relation schema for R1, sole candidate key is AB. The applicable FD's LHS is completely outside the candidate key.

Of course, if C->B effectively applies, then AB is no longer really the sole candidate key, since AC is a candidate key too, but even so the FD at hand "starts from" a proper subset of a candidate key, meaning it's not even 2NF (if I recall my NF's correctly. I have to use a "donkey bridge" to remember. The key, the whole key, and nothing but the key. "whole key"=2, "nothing but the key"=3.)

nvitac...@gmail.com

unread,

May 6, 2013, 4:05:57 PM5/6/13

to

Il giorno lunedì 6 maggio 2013 20:27:36 UTC+2, Erwin ha scritto:
> Op maandag 6 mei 2013 17:19:36 UTC+2 schreef nvitac...@gmail.com het volgende:
>

> > Finally I ask them: does this decomposition preserve FDs? (“yes, by construction”). Is C -> B implied by F? (“yes, by transitivity”). Is C -> B preserved or not in the decomposition? (panic) :)
>
> > Nicola
>

> You've added a bit of another dimension.
>
> c->b is not an explicitly stated FD (not member of the canonical cover ???) but a "derivable" one.

Yes, the set of FDs I have proposed is a canonical cover.

> Does the issue of dependency preservation also apply to "derived"(/-able) FDs ? Should it ? Was it consciously so intended ?

I'd say yes to all questions. The fact that a dependency is not explicitly stated doesn't make it less real. Besides, it's a good thing that we don't have to make all dependencies explicit: a set of FDs may well have an exponential number of logical consequences (although these may be pathological cases)!

> If so, then what about the "derived" FDs that express what the candidate keys are in a relation schema (say, ae->abcde) ? What is _their_ preservation status in any decomposition ?

Good point. Dependency preservation is about all dependencies, explicit or not; key dependencies are no different. As it has been accurately said by Jan Hidders, if you take the join of all the schemata of a dependency-preserving decomposition, the resulting relation satisfies all the original dependencies (explicit or not). In the particular example, if you take r1 join r2 join … join r5, AE -> ABCDE is always satisfied. in this sense, the key dependency is preserved.

Formally, a decomposition is “dependency-preserving” if the union U of all the FDs associated to the decomposed schemata is _equivalent_ (not equal!) to the set F of FDs in the original schema. That is, every dependency in F _or derivable from F_ must be _derivable_ from U, and vice versa.

The point with C -> B it is that it is not strictly needed as far as dependency preservation, as just defined, is concerned. So, Jan Hidders was asking, why would anyone insist in adding it to R1(A,B,C), making the schema be in 3NF instead of BCNF, when, after all, it is satisfied by the join of R1,…, R5 anyway?

>Of course, if C->B effectively applies, then AB is no longer really the sole candidate key, since AC is a candidate key too, but even so the FD at hand "starts from" a proper subset of a candidate key, meaning it's not even 2NF

No, R(A,B,C) with {AB -> C, C -> B} is the classical example of a schema that is in 3NF (because all attributes are prime, that is, belong to some key) but not in BNCF.

Erwin

unread,

May 6, 2013, 4:56:48 PM5/6/13

to

I would say :

(a) because it's a rule that applies
(b) not enforcing a rule that applies means that "nonconforming data" can enter R1
(c) and that means that "nonconforming data" can be shown to the user anytime R1 is inspected without being joined with the four others.

That's what I call the "narrowed down vision of normalization theory". Decompositions are proposed, and properties of those decompositions are calculated and proven, but always under some apparent assumption that the decomposition will always be undone by always joining everything back together again.

That simply isn't how it (reality) works, is it ?

All those "dependency preservation" and "lossless" properties _depend_ on "all decompositions always being undone back again", and that's exactly what underpins that bold claim of mine (where I said "narrow and perverse meaning of the term" and "essentially useless").

"R(A,B,C) with {AB -> C, C -> B} is the classical example of ..."

Shit. So when Boyce & Codd "fixed" the definition of 3NF for coping with multiple keys, what they actually should have been addressing was 2NF as well :-)

com...@hotmail.com

unread,

May 6, 2013, 5:06:18 PM5/6/13

to

On Monday, May 6, 2013 12:48:10 PM UTC-7, Erwin wrote:
> The key, the whole key, and nothing but the key. "whole key"=2, "nothing but the key"=3.

one or more of key + whole key + nothing but the key = 2
key + whole key = 3
key + whole key + nothing but the key = bcnf

as pointed out by Nicola, R1 is in 3nf but not bcnf.

philip

com...@hotmail.com

unread,

May 6, 2013, 5:44:04 PM5/6/13

to

On Monday, May 6, 2013 8:18:38 AM UTC-7, Jan Hidders wrote:
> On 2013-05-06 12:26:10 +0000, Erwin said:

> > How can deliberate admission of "information differences" coexist with
> > "aims of being lossless" ?

It is called a "lossless" decomposition because if the components are projections of the original then they join to the original. Nb this assumes certain predicates for the components in terms of the original. (It has nothing per se to do with other predicates.) Nb this does not deal with appropriate constraints on the components or the original.

Compare to additionally being "dependency-preserving" where if given dependencies hold in the components then they hold in the original. So enforcing them in the components enforces them in the join. Although moving from 3nf to bcnf is not dependency-preserving, one can introduce suitable inter-component constraints. Nb other constraints are not dealt with by normalization.

> Well, "lossless" means nothing is lost, but, yes, something is allowed
> to be added, i.e., the new schema might be able to represent
> information that the old one could not.

Normalization does not "allow" this. What happens is people normalize, then they notice they really wanted a schema allowing more states (ie predicates that the normalization-generated predicates imply), which they didn't need to normalize to find, then they change to new the predicates for the components, giving a new predicate for their join, BUT in such a way that just happens to leave the new components properly normalizing the new join. They confuse this process with normalization and do not understand normalization's role in it.

> > I contend that the "lossless" in conventional normalization can be
> > upheld only for such a perverse and narrow meaning of the term that the
> > claim (of having "lossless" decompositions) becomes essentially
> > meaningless/futile/irrelevant/...

> That's an .. er .. interesting claim. :-)

Erwin is concerned with information-equivalent schemas and appropriate constraints. His error is in confusing "lossless" with this.

Other than that he is correct. Received wisdom for use of normalization in the design process is confused. If you don't use the existential component predicates per my earlier message, ie the components are not projections of the original, then you are not using the design you normalized.

The reason why received-wisdom design process of normalizing one design but using a different one works is that if you decide to use new component predicates with the same dependencies and a new original that is their join (thus with a new predicate that is their conjunction) then this new decomposition of a new original into new components will also be in the same normal form. (Appropriate non-dependency constraints will have to be determined.)

Other predicates/variables on other subsets of the attributes remain unexplored. Normalization is neither sufficient nor necessary for that. You could just determine what predicates you want on what subsets of attributes and then normalize. But then the desired superset predicates tend to be obvious conjunctions of the subset ones. Thus Fabian's normalization-light designs per Erwin's earlier message.

philip

Erwin

unread,

May 6, 2013, 6:14:40 PM5/6/13

to

On Monday, May 6, 2013 11:44:04 PM UTC+2, com...@hotmail.com wrote:
> Erwin is concerned with information-equivalent schemas and appropriate constraints. His error is in confusing "lossless" with this.

Yes. Guilty. (Figured it out by now.)

Jan Hidders

unread,

May 6, 2013, 6:51:01 PM5/6/13

to

I'm afraid you are losing me here. You seem to be unilaterally
redefining the notion of normalization and then accusing other people
of misunderstanding what they are doing just because they are using a
different definition. That's not going to lead to a productive
discussion.

-- Jan Hidders

nvitac...@gmail.com

unread,

May 7, 2013, 5:54:47 AM5/7/13

to

Normalization theory based on functional dependencies says nothing about the inter-relational constraints of the decomposition—and it can't, because normal forms are defined wrt a single relational schema. Consider R(A,B,C) with F = {A -> B, B -> C}, which can be decomposed into (R1(A,B), {A -> B}) and (R2(B,C), {B -> C}). Which of the following would you consider valid instances?

(1) Information-equivalent, that is, R1[B] = R2[B]:

r1: {(a,b)}
r2: {(b,c)}

(2) No inter-relational constraint:

r1: {(a,b)}
r2: {(c,d)}

(3) Referential integrity, that is, R1[B] ⊆ R2[B]:

r1: {(a1,b1)}
r2: {(b1,c1), (b2, c2)}

(4) Inclusion dependency (IND), that is, R2[B] ⊆ R1[B]:

r1: {(a1,b1),(a2,b1),(a3,b2)}
r2: {(b1,c1)}

(if you don't see the symbol ⊆, it is the “(not necessarily proper) subset of” symbol).

One may argue that normalization (implicitly?) deals only with case (1), and that (2)–(4) are the result of realizing, after normalizing, that the original design was, in fact, incorrect (i.e., wrong predicates). Whether you consider this “healing” feature a goal of normalization or an entirely different process may be argued at length (my point of view is the former).

The way I see it, the decomposition is not the end of the story. At that point, a much more complex process starts: you must discover how to keep the whole database schema together (in the most basic case, this means you have to discover which INDs hold). This is where normalization theory falls short, in my opinion. For example:

S(A,B,C), {A -> BC}
R(A,B), {}

are each in BCNF. Now, suppose that you determine that R[A,B] ⊆ S[A,B]. Does that introduce any redundancy? Does it change the keys of R? Is it still a good database schema (in some sense to be defined)?

The issue the OP had falls in this category:

R(a,b,c,d), {abc -> d}
S(a,b,e), {ab -> e}
T(b,c,e), {bc -> e}

Has this redundancy or anomalies? Can this be simplified further? Under which hypotheses?

As far as I know, there is no accepted notion of normalization for the database schema as a whole. The two definitions that I know of which take into account FDs and INDs (*) are not satisfactory in my opinion (we may discuss that in another thread if you want).

Yes, even reasoning with just functional and inclusion dependencies together is very hard (read: many problems become undecidable). But I'd like to see, if I may say so, a more holistic approach to the process.

Nicola

(*) Tok Wang Ling and Cheng Hian Goh. Logical database design with inclusion dependencies. In Proceedings of the Eighth International Conference on Data Engineering, pages 642–649, 1992.

(*) Mark Levene and Millist W. Vincent. Justification for inclusion dependency normal form. IEEE Transactions On Knowledge and Data Engineering, 12(2):281–291, March 2000.

nvitac...@gmail.com

unread,

May 7, 2013, 9:22:55 AM5/7/13

to

> Normalization theory based on functional dependencies says nothing about the inter-relational constraints of the decomposition

“It says nothing about the inter-relational constraints of a database schema” would be more precise. Here is another example:

R(A,B,C), {B -> A}
S(D,E), {D -> E}

R is not in 2NF, the only key being BC. Normalization theory would then suggest that it must be decomposed. Now, suppose that, in addition to the above, you impose R[B,C]⊆S[D,E] (a referential integrity constraint from (B,C) to (D,E). Would you still decompose R? Or would you do something else?

Erwin

unread,

May 7, 2013, 9:38:18 AM5/7/13

to

Op dinsdag 7 mei 2013 11:54:47 UTC+2 schreef nvitac...@gmail.com het volgende:

"Normalization theory based on functional dependencies says nothing about the inter-relational constraints of the decomposition—and it can't, because normal forms are defined wrt a single relational schema."

Well, that is not entirely true, is it ?

I'm willing to bet that many a course in database normalization will effectively mention something along the lines of "and you have to introduce a foreign key constraint between the decomposed tables". Date does it too in "Introduction to Database Systems", 8ed. !

Why is that ????????

And why is it that only one of the possible two inclusion dependencies are typically ever mentioned/considered ??????? Why is it that "spurious tuples" must indeed be prevented, through the IND, from appearing in R(abcd), but that "spurious tuples" must _not_ be prevented from appearing in S(abe) or in T(bce) ???????

Because "usually", it is exactly what is desired anyway ???????? I'm inclined to agree, pragmatically, but I cannot accept that a "true scientist" would content himself with that answer. A "true scientist" would set out to seek, formally and precisely, when _exactly_ "usually" is indeed the case, and when exactly it is not.

I would love to discuss "database normal forms". As explicitly opposed to "relation normal forms".

Jan Hidders

unread,

May 7, 2013, 9:50:35 AM5/7/13

to

I'd argue that classical normalization theory is agnostic here: it
simply doesn't talk about any other dependencies that might be implied,
and so in some sense considers all of them valid. Case in point:
consider the claim that one can always reach 3NF while staying
dependency preserving. Under assumption (1) this claim becomes,
although strictly speaking true, virtually meaningless from a practical
point of view. Yes, all the original FDs are implied by the local ones,
but along the way you may have had to introduce inter-component
dependencies, some of them not even expressible as INDs, to stay
information equivalent (*). That might be a (too) heavy a price to pay
and one that may dwarf the cost of having any inter-component FDs.
However, under (2) this reachability claim actually does make sense. On
the other hand, most of the original theoretical papers worked under
the universal relation assumption, whcih seems to be more consistent
with (1). So, as I said, I think classical normalization theory is
mostly agnostic here, assuming that such a statement actually makes
sense and is not anthropomorphising too much.

(*) Note the relationship with the observation that you cannot reach
3NF by just splitting of violating dependencies.

> Whether you consider this “healing” feature a goal of normalization or
> an entirely different process may be argued at length (my point of view
> is the former).

Indeed. But I'd consider that a rather vacuous discussion as the answer
is mostly a matter of defintion. Life is short, I have exams to grade,
articles to write, faculty meetings to attend, and so preferrably try
to stay away from that type of discussion. :-) Let us please
concentrate on what constitutes a good design theory and what types of
algorithms and theorems are needed to help with that. We will give that
baby a name once it is born, not before. :-)

> The way I see it, the decomposition is not the end of the story. At
> that point, a much more complex process starts: you must discover how
> to keep the whole database schema together (in the most basic case,
> this means you have to discover which INDs hold). This is where
> normalization theory falls short, in my opinion. For example:
>
> S(A,B,C), {A -> BC}
> R(A,B), {}
>
> are each in BCNF. Now, suppose that you determine that R[A,B] ⊆ S[A,B].
> Does that introduce any redundancy? Does it change the keys of R? Is it
> still a good database schema (in some sense to be defined)?

Indeed. Those are the right questions. My tentative answers:
- Yes.
- Depends on what you mean by "the keys of R". Presuming we are doing
logical design and so are happy with a set of constraints that implies
all the ones we know, I'd say we don't need ot add this key constraint.
From a practical point of view it migh of course be useful.
- Maybe, as I don't see one at the moment that is better in every respect.

> The issue the OP had falls in this category:
>
> R(a,b,c,d), {abc -> d}
> S(a,b,e), {ab -> e}
> T(b,c,e), {bc -> e}
>
> Has this redundancy or anomalies? Can this be simplified further? Under
> which hypotheses?

It all depends on which INDs, TGDs, etc. you think should hold in addition.

> As far as I know, there is no accepted notion of normalization for the
> database schema as a whole. The two definitions that I know of which
> take into account FDs and INDs (*) are not satisfactory in my opinion
> (we may discuss that in another thread if you want).

Interesting. The Levene et al. paper I knew, but not the Ling et al.
paper. I'd be very interested in such a discussion.

> Yes, even reasoning with just functional and inclusion dependencies
> together is very hard (read: many problems become undecidable). But I'd
> like to see, if I may say so, a more holistic approach to the process.

Me too. :-)

-- Jan Hidders

Jan Hidders

unread,

May 7, 2013, 10:07:49 AM5/7/13

to

R is then in BCNF, since B has become a key. So no splitting needed.

-- Jan Hidders

nvitac...@gmail.com

unread,

May 7, 2013, 10:10:43 AM5/7/13

to

The reasons why you “have to” introduce foreign key constraints after decomposing are the same as why you “have to” introduce functional dependencies in the first place: it's because the semantics of the data requires it.

Normalization does not provide an answer to the questions you pose: you “see”, by looking carefully at the meaning of the data, that some inter-relational constraints must hold much in the same way as you “see” that some FDs must hold (unless, of course, you insist that the decomposition must be “information-equivalent” to the original schema, in which case, you may probably infer such constraints automatically).

> I would love to discuss "database normal forms". As explicitly opposed to "relation normal forms".

Give me some time to refresh my mind on the topic ;)

Nicola

Erwin

unread,

May 7, 2013, 10:20:58 AM5/7/13

to

Op dinsdag 7 mei 2013 11:54:47 UTC+2 schreef nvitac...@gmail.com het volgende:

> The way I see it, the decomposition is not the end of the story. At that point, a much more complex process starts: you must discover how to keep the whole database schema together (in the most basic case, this means you have to discover which INDs hold). This is where normalization theory falls short, in my opinion. For example:
>
>
>
> S(A,B,C), {A -> BC}
>
> R(A,B), {}
>
>
>
> are each in BCNF. Now, suppose that you determine that R[A,B] ⊆ S[A,B]. Does that introduce any redundancy? Does it change the keys of R? Is it still a good database schema (in some sense to be defined)?

Am I allowed to interject my own tentative answers here ?

Does it [introducing the IND] introduce redundancy ?

Yes. Because the IND now causes the A->B FD to "carry over" to R.

Does it changes the keys of R ?

Yes. Because that IND has been "carried over".

Is it still a good database design ?

No, unless in the case when I very deliberately want to retain the redundancy in the database, as a sort of "four-eyes" check on the actual assertions that are being made in the database update process[es].

The constraint (the IND, that is) that is needed to "control this redundancy" is a proposition that says "for any A value in R, the corresponding B value found in R must be as can also be found in the corresponding tuple from S, which must exist."

An S updater might have asserted "A value is 3 and corresponding B value is 7".

An R updater (or, depending on how you look at it, the entire overall database update process) is now "double-checked" by forcing him to also specify the B value 7 whenever he inserts the A value 3. Forcing him to sort of "confirm his awareness of the nature of that IND".

In all other cases, I would remove the redundancy.

nvitac...@gmail.com

unread,

May 7, 2013, 10:50:03 AM5/7/13

to

Sure. You need to look at the whole database schema to infer that. And, in the general case, there is no algorithm that is guaranteed to give you an answer! Data modeling is hard, isn't it? :)

The schema above could then be replaced by:

R(A,B), {B -> A}
S(D,E), {D -> E}

plus R[B]⊆S[D] (a foreign key from B to D). Arguably, this schema is better because redundant information (the C attribute) has been removed.

Personally, I believe that this extension of normalization theory, including at least special cases of INDs, hasn't be fully understood yet (or, well, _I_ haven't fully understood it), among other reasons, I guess, because computational complexity results are discouraging. The papers I've mentioned give significant contributions, but I don't think that they are the final answer. And we are still considering only the most basic types of constraints… How lucky NoSQL users are, who don't need data modeling (e.g., http://www.couchbase.com/why-nosql/nosql-database)! :)

nvitac...@gmail.com

unread,

May 7, 2013, 11:12:25 AM5/7/13

to

On Tuesday, May 7, 2013 4:20:58 PM UTC+2, Erwin wrote:

> Am I allowed to interject my own tentative answers here ?
>
> Does it [introducing the IND] introduce redundancy ?
>
> Yes. Because the IND now causes the A->B FD to "carry over" to R.
>
> Does it changes the keys of R ?
>
> Yes. Because that IND has been "carried over".
>
> Is it still a good database design ?
>
> No, unless in the case when I very deliberately want to retain the redundancy in the database, as a sort of "four-eyes" check on the actual assertions that are being made in the database update process[es].
>
> The constraint (the IND, that is) that is needed to "control this redundancy" is a proposition that says "for any A value in R, the corresponding B value found in R must be as can also be found in the corresponding tuple from S, which must exist."
>
> An S updater might have asserted "A value is 3 and corresponding B value is 7".
>
> An R updater (or, depending on how you look at it, the entire overall database update process) is now "double-checked" by forcing him to also specify the B value 7 whenever he inserts the A value 3. Forcing him to sort of "confirm his awareness of the nature of that IND".
>
> In all other cases, I would remove the redundancy.

I essentially agree with all you say, but please note that the point about the key is a delicate one: it means that now to determine the keys of a relation schema you need to look at the whole database schema!

Erwin

unread,

May 7, 2013, 11:27:25 AM5/7/13

to

Op dinsdag 7 mei 2013 17:12:25 UTC+2 schreef nvitac...@gmail.com het volgende:

The 1992 paper you pointed to contains/points out the formal foundation for what I informally called "carrying over". Page 2, lemma 1, "pullback rule".

And it implies that you have to look at all the (and only those) relation schemas that are the "target" of an IND, to see if any FDs can be inferred/carried over to the "source" of the IND.

Remaining question might be whether there are any other rules that could also cause "possible inference/carrying over" ... I can certainly see things getting complicated there.

nvitac...@gmail.com

unread,

May 7, 2013, 11:34:44 AM5/7/13

to

On Tuesday, May 7, 2013 5:27:25 PM UTC+2, Erwin wrote:

> The 1992 paper you pointed to contains/points out the formal foundation for what I informally called "carrying over". Page 2, lemma 1, "pullback rule".
>
> And it implies that you have to look at all the (and only those) relation schemas that are the "target" of an IND, to see if any FDs can be inferred/carried over to the "source" of the IND.
>
> Remaining question might be whether there are any other rules that could also cause "possible inference/carrying over" ... I can certainly see things getting complicated there.

Definitely. And provably so.

The “implication problem” for a set of dependencies (of some form) is the problem of proving that a given dependency is a logical consequence of others (e.g., A->C is a logical consequence of A->B and B->C). For FDs, the implication problem can be solved in linear time. For INDs, the implication problem is PSPACE-complete. For FDs and INDs combined, the implication problem is undecidable. Etc…

Erwin

unread,

May 7, 2013, 1:07:24 PM5/7/13

to

Op dinsdag 7 mei 2013 17:34:44 UTC+2 schreef nvitac...@gmail.com het volgende:

That seems like a curious result.

Given that all FDs correspond to keys, all keys are database constraints, and all database constraints can be expressed as INDs.

Jan Hidders

unread,

May 7, 2013, 2:34:54 PM5/7/13

to

On 2013-05-07 14:50:03 +0000, nvitac...@gmail.com said:

> On Tuesday, May 7, 2013 4:07:49 PM UTC+2, Jan Hidders wrote:
>
>> On 2013-05-07 13:22:55 +0000, nvitac...@gmail.com said:
>>>> Normalization theory based on functional dependencies says nothing> >>
>>>> about the inter-relational constraints of the decomposition
>>
>>> “It says nothing about the inter-relational constraints of a database>
>>> > schema” would be more precise. Here is another example:
>>
>>> R(A,B,C), {B -> A}> > S(D,E), {D -> E}
>>
>>> R is not in 2NF, the only key being BC. Normalization theory would then
>>> suggest that it must be decomposed. Now, suppose that, in addition to>
>>> > the above, you impose R[B,C]⊆S[D,E] (a referential integrity
>>> constraint> > from (B,C) to (D,E). Would you still decompose R? Or
>>> would you do> > something else?
>>
>> R is then in BCNF, since B has become a key. So no splitting needed.
>
> Sure. You need to look at the whole database schema to infer that. And,
> in the general case, there is no algorithm that is guaranteed to give
> you an answer! Data modeling is hard, isn't it? :)
>
> The schema above could then be replaced by:
>
> R(A,B), {B -> A}
> S(D,E), {D -> E}
>
> plus R[B]⊆S[D] (a foreign key from B to D). Arguably, this schema is
> better because redundant information (the C attribute) has been removed.

Indeed. Why did I miss that? I bow my head in shame. :-)

But, to work. So we would like to define normal forms that characterize
avoidable or on the other hand the complete lack of redundancy. I can
think of several candidates for characterization:

(A) Can be losslessly transformed into another schema with no
redundancy, where in the new schema where all implied local
dependencies (FDs and INDs) are maintained.
(B) Can be losslessly transformed into another schema with no
redundancy, where in the new schema all implied local FDs and global
INDs are maintained.
(C) Can be losslessly transformed into another schema with no
redundancy, where in the new schema all implied EGDs and TGDs (local
and non-local) are maintained.

Of course the notion of transformation needs to be suitably defined.
Probably transformations consist only of projections, renamings and
joins. Note that the last one covers the case where the schema needs to
be information-equivalent, presuming we start with only EGDs and TGDs.

I smell a good paper. :-)

> How lucky NoSQL users are, who don't need data modeling (e.g.,
> http://www.couchbase.com/why-nosql/nosql-database)! :)

I sometimes point them to Libkin's work on XML normal forms and tell
them "that's now your normalzation theory right there, and things
actually get a lot more compicated". That work applies to any data
model that is hierarchical / semistructured like XML or JSON.

-- Jan Hidders

Erwin

unread,

May 8, 2013, 6:30:45 AM5/8/13

to

Op dinsdag 7 mei 2013 16:50:03 UTC+2 schreef nvitac...@gmail.com het volgende:

In this example, it seemed to me like the "pullback rule" was sufficient to find the answer. Applying the pullback rule to a finite set of explicitly stated INDs, their sources and their targets, certainly seems like a feasible piece of algorithm.

So what is "the general case" ?

nvitac...@gmail.com

unread,

May 8, 2013, 5:58:01 PM5/8/13

to

On Wednesday, May 8, 2013 12:30:45 PM UTC+2, Erwin wrote:
> Op dinsdag 7 mei 2013 16:50:03 UTC+2 schreef nvitac...@gmail.com het volgende:

> "And, in the general case, there is no algorithm that is guaranteed to give you an answer!"
>
> In this example, it seemed to me like the "pullback rule" was sufficient to find the answer. Applying the pullback rule to a finite set of explicitly stated INDs, their sources and their targets, certainly seems like a feasible piece of algorithm.
>
> So what is "the general case" ?

The general case is when you allow a database schema to have FDs together with an arbitrary set of INDs, in particular INDs that may form cycles. The pullback rule is not enough to infer all the consequences. In fact, there is no complete recursively enumerable set of axioms for finite implication (that is, something like Armstrong axioms for FDs). For all the details, see:

J. C. Mitchell. The implication problem for functional and inclusion dependencies. Information and Control, 56:154–173, 1983.
M. A. Casanova, R. Fagin, and C. H. Papadimitriou. Inclusion dependencies and their interaction with functional dependencies. Journal of Computer and System Sciences, 28:29–59, 1984.
A. K. Chandra and M. Y. Vardi. The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput., 14(3):671–677, Aug. 1985.

I believe that these negative results have basically discouraged research on normal forms including INDs, although, for many subclasses, problems become decidable. Ling and Goh's approach, if my memory assists me, was to define a normal form based only on those dependencies derivable with the pullback rule (so, using an incomplete inference system); Mannila and Räihä's approach, later studied by Levene and Millist, was to impose that FDs and INDs do not interact (so that the implication problems for FDs and INDs could be treated independently).

Nicola

Jan Hidders

unread,

May 9, 2013, 7:09:26 AM5/9/13

to

I believe that part of the problem is also that it is not so clear
anymore what we would like to / can achieve. What does it mean to be
non-redundant? The old goal was to remove redundancy, something we can
define exactly and could always achieve, albeit sometimes at the cost
of inter-relational dependencies, but with INDs there's always a little
bit of redundancy that we cannot really remove and probably don't want
to. So can we precisely formulate the new goals? Let me give it a try.

First let me define what I mean with redundancy: a database schema
allows redundancy if there is an instance of that schema where there is
a a particar field of a tuple that is derivable from the remainder of
the instance. In other words, if we replace the value of the field with
a variable x the we can derive the value of x. I will call a schema
that does not allow redundancy, redundancy-free.

I will also assume that normalization is being done by taking
projections. I will get back to that later because there are reasons to
also include equi-joins. I will make a distinction between liberal
normalization (the dependencies in the normalized schema imply all
dependencies in the original schema, but not necessarily vice versa)
and conservative normalization (also vice versa).

Clearly the goal is not being completely redundancy-free, although it
is of course better if we can achieve that. The type of redundancy we
don't seem to mind is where INDs are copying values between different
columns. It's also clear that you cannot get rid of that type of
redundancy by taking projections unless you are projecting columns away
completely, i.e., it does not return anywhere in the normalized schema.
So we are maximally normalized if:
1. There are no columns that can be omitted, and
2. The schema is redundancy-free if we consider all derivable
dependencies except the INDs.

You can generalize this for EGDs and TGDs by replacing "INDs" with
"TGDs that copy values between different columns."

So can we always reach this normal form? For liberal normalization this
seems straightforward, but what about conservative normalization? What
about the case where we start with a single relation with only FDs?

How am I doing sofar? :-)

Final note on the type of operations we can use when normalizing; I
suspect we should also allow equi-joins, or actualy any conjunctive
query. Consider schema R(A,B) S(B,C) with { R[B]=>S[B}, S[B}=>R[B] }.
According to the above this is fully normalized, but clearly the
equivalent R'(A,B,C) with { } is a schema with less redundancy and the
DMBS has less to check. But it's not clear to me at the moment how this
changes the definition of the normal form. Probably adding an extra
rule something like: "you cannot join relations without introducing
redundancy as defined under rule 2."

-- Jan Hidders

Erwin

unread,

May 9, 2013, 2:38:09 PM5/9/13

to

Op donderdag 9 mei 2013 13:09:26 UTC+2 schreef Jan Hidders het volgende:

If there is a JD *(R,S) then conventional normalization theory says "maintaining the split" is the better option. Interesting. So let's elaborate the two cases.

(A) one schema SR{A,B,C} with the constraint expressing the JD : SR === SR{A,B} JOIN SR{B,C}. That's two INDs

SR SUBSETOF ( SR{A,B} JOIN SR{B,C} )
( SR{A,B} JOIN SR{B,C} ) SUBSETOF SR

Eurhm, well, if I'm right that INDs can also include arbitrary RA expressions.

This is "hard" on all stakeholders (user, designer & DBMS) of this database :

The computations involved for the DBMS in checking the constraint are not exactly trivial.
Determining which access paths (indexes) need to be available for efficient checking to be possible might be not exactly trivial either.
If we wanted to write a DELETE statement that always passes the constraint (not very sure of how relevant this is), then deleting a tuple a1,b1,c1 will involve DELETE (SR{A,B} WHERE B=b1) JOIN (SR{B,C} WHERE B=b1) FROM SR;
Something similar, but utterly more horrendous, applies to INSERTs of a tuple a1,b1,c1.

(B) two schema R{A,B} and S{B,C} with the constraint expressing the "no join orphans" in either schema : R{B} === S{B}. Once again two INDs

R{B} SUBSETOF S{B}
S{B} SUBSETOF R{B}

The computations needed for checking these seem simpler.
Determining needed indexes for efficiency of checking is fairly trivial.
Writing INSERTs and DELETEs that are "guaranteed to pass" for a given a1,b1,c1 (once again not sure though how relevant this is) are trivial for INSERT (just INSERT {a1,b1} INTO R, INSERT {b1,c1} INTO S; ) but substantially more intracate for DELETE.

All in all, it seems to me like if there is a JD between the attributes of the decomposed schemata, the conventional rule for 5NF should prevail despite the "redundancy" you considered.

Jan Hidders

unread,

May 9, 2013, 9:52:39 PM5/9/13

to

Thanks for pointing that out. You are of course completely right. For
some reason I had forgotten that there was a join-dependency in the
result of the join (duh), and those are indeed hard to check fo the
database. Good, that means we can stick with the assumption that
normalization is done by taking projections.

-- Jan Hidders

Jan Hidders

unread,

May 10, 2013, 4:29:36 AM5/10/13

to

… unless they are implied by keys. So the proper example should be:

R{A, B} S{B, C} with { R.B -> R.A, S.B -> S.C, R[B] => S[B], S[B] => R[B] }

which perhaps should be normalized to:

T'{A, B, C} with { T.B -> T.A, T.B -> S.C }

So I need to recant my recantation. :-) Projection should perhaps be
considered as normalizing operations.. But it complicates the theory,
so it's probaly better to first try to understand normalization without
it.

With respect to another comment you made: "Eurhm, well, if I'm right

that INDs can also include arbitrary RA expressions."

Yes, they can. Strictly speaking normal INDs don't allow that, but we
need to move here to a more general and powerfull class of dependencies
known as TGDs (tuple-generating dependencies) and EGDs (equality
generating dependencies). And TGDs can indeed be described as INDs
where on the left-hand side you can use in the place of the relation
name a conjunctive query, i.e., an RA expression containing relation
names, equality selections, cartesian products, projections and
renaming (but not the other RA operations). As you already saw, that
class then also generalize MVDs and JDs.

The EGDs generalize functional dependencies: you specify a conjunctive
query and state then that two columns are identical. For example, the
FD A->B in R means that in

SELECT[A=A'](R TIMES RENAME[A to A', B to B'](R))

it holds that B and B' always contain the same value.

-- Jan Hidders

com...@hotmail.com

unread,

May 10, 2013, 5:03:27 AM5/10/13

to

On Thursday, May 9, 2013 11:38:09 AM UTC-7, Erwin wrote:
> All in all, it seems to me like if there is a JD between the attributes of the decomposed schemata, the conventional rule for 5NF should prevail despite the "redundancy" you considered.

The INDs hold on S and R iff JD *{AB,BC} holds on their join. So an adequate SR constraint is SR = SR{A,B} JOIN SR{B,C}. Since a JD holds that is not implied by a key, some single-tuple changes will be invalid (ie this version is subject to update anomolies). Whereas the R+S version is in 5nf.

All the other specifics of your message are just consequences of non-5nf vs 5nf.

On Friday, May 10, 2013 1:29:36 AM UTC-7, Jan Hidders wrote:
> R{A, B} S{B, C} with { R.B -> R.A, S.B -> S.C, R[B] => S[B], S[B] => R[B] }
> which perhaps should be normalized to:
> T'{A, B, C} with { T.B -> T.A, T.B -> S.C }

But now the keys imply the JD (and the INDs) so you don't need to declare/mention it (or them). Presumably it should be normalized to this, we can always decompose to 6nf by separating non-key attributes but we don't.

philip

com...@hotmail.com

unread,

May 10, 2013, 5:28:49 AM5/10/13

to

On Friday, May 10, 2013 2:03:27 AM UTC-7, com...@hotmail.com wrote:
> On Friday, May 10, 2013 1:29:36 AM UTC-7, Jan Hidders wrote:
> > R{A, B} S{B, C} with { R.B -> R.A, S.B -> S.C, R[B] => S[B], S[B] => R[B] }
> > which perhaps should be normalized to:
> > T'{A, B, C} with { T.B -> T.A, T.B -> S.C }
>
> But now the keys imply the JD (and the INDs) so you don't need to declare/mention it (or them).

I was trying to express that any such T'(A,B,C) with FDs from its key and no explicit JD or INDs can decomposed to such projections so given the FDs and the JD or INDs we can obviously go the other way.

philip

Jan Hidders

unread,

May 10, 2013, 6:39:16 AM5/10/13

to

To answer my own question: no, you cannot. Consider the example that
started this thread:

R{A, B, C, D, E} with { AB -> E, BC -> E }

Suppose we split into R1{A, B, C, D} and R2{A, B, E}. Consider the
following instance:

R1:
a1 b1 c1 d1
a2 b1 c1 d2

R2:
a1 b1 e1
a2 b1 e1(*)

The marked e1 is derivable. The case for splitting off the other
dependency is the same.

The normal form does become reachable if we switch to semi-conservative
normalization where in the normalized schema we require in the
normalized schema not all dependencies derivable from those in the
original schema, but only the local ones. That seems to be the position
that conventional normalization is taking, or at least the Bernstein
synthesis algorithm. Note that under that regime (semi-conservative
normalization) you also cannot fully normalize the standard
3NF-but-not-BCNF example: R{A, B, C} with { AB -> C, C -> B }.

-- Jan Hidders

nvitac...@gmail.com

unread,

May 10, 2013, 7:24:35 AM5/10/13

to

> The old goal was to remove redundancy, something we can
> define exactly and could always achieve, albeit sometimes at the cost
> of inter-relational dependencies, but with INDs there's always a little
> bit of redundancy that we cannot really remove and probably don't want
> to.

Indeed.

> First let me define what I mean with redundancy: a database schema
> allows redundancy if there is an instance of that schema where there is
> a a particar field of a tuple that is derivable from the remainder of
> the instance. In other words, if we replace the value of the field with
> a variable x the we can derive the value of x. I will call a schema
>that does not allow redundancy, redundancy-free.

Your definition is the so-called “value redundancy”, which characterizes BCNF. It has already been proved that any database schema with non trivial INDs cannot be devoid of value redundancy (see Levene & Millist's paper, Lemma 4.2). In that same paper, they give a more appropriate definition of redundancy and they prove that IDNF (inclusion dependency normal form) is equivalent to avoiding that kind of redundancy (a database schema in IDNF also avoids insertion and modification anomalies, in a sense that can be precisely defined). So, I think that this direction has been already explored and, unless one disagrees with the definitions, IDNF captures the salient notions adequately.

For completeness, a database schema with a set of FDs F and a set of INDs I is in IDNF iff it is in BCNF with respect to F and the INDs represent acyclic referential integrity constraints (where the targets are keys, not super-keys).

Despite its nice properties, this normal form may be too restrictive in two respects: first, cyclic dependencies arise in practice (the authors, at the end of the paper, propose to relax IDNF by allowing INDs that express “pairwise consistency”, as in your example); second, are designs with INDs that are not key-based necessarily “bad”? One motivation for restricting INDs that way is that those conditions imply that FDs and IND do not interact (in particular, the keys of each schema can be determined based only on the FDs on that schema).

> Clearly the goal is not being completely redundancy-free, although it
> is of course better if we can achieve that.

Definitely.

> So we are maximally normalized if:
> 1. There are no columns that can be omitted, and
> 2. The schema is redundancy-free if we consider all derivable
> dependencies except the INDs.

It would be interested to verify how this compares with the above. I am not sure whether I understand exactly what you mean by 2.

> Final note on the type of operations we can use when normalizing; I
> suspect we should also allow equi-joins, or actualy any conjunctive
> query.

So, your starting point is an arbitrary set of schemas instead of one, to which you add FDs and inter-relational constraints as per your requisites, and then you apply some rule to split as well as merge the schemas. Sounds challenging :) Btw, is there any intrinsic difference in considering several schemas instead of a single universal schema, as far as this process is concerned?

Nicola

Jan Hidders

unread,

May 10, 2013, 11:34:34 AM5/10/13

to

You are right (I have looked in the past at that paper, but apparently
had forgotten about the details) but I don't agree that this is closed
subject. By no means. Note that I am in a sense trying to generalize
what they did for larger classes of dependencies. I have to do that
since I want to address the questions of achievabality of the normal
forms (under liberal/semi-conservative/conservative normalization, as I
called them). Under conservative normalization it can happen that
starting from a schema with FDs and INDs I end up with a schema with
more complex TGDs then just INDs. Therefore I need to be able to take
those into account. What is also missing in their work is a
characterization of when you can get to IDNF, and an algorithm that
when this is possible gets you there. I realized that strictly speaking
such an algorithm will not exist because of the undecidability of
reasoning over the dependencies in question, but it would still be
interesting to see if it exists while assuming an oracle that can do
the reasoning for you. It might be practical for certain restricted
sets of dependencies where inference is computable. Maybe this is the
case, for example for the dependencies implied by conservative
normalization when starting out with only FDs.

Note that the type of analyis I want to do is the one that tells you
for example that even if you start with just FDs then (generalized)
RFNF / BCNF is not always reachable, assuming you conservatively
normalize. That was actually a bit of a suprise to me. I always assumed
you could reach a redundancy-free normal form if you were willing to
pay the price of a non-local dependency, but that is actually not true.
So what about the cases where you can? Can I use the usual "split of
violating dependencies" strategy without backtracking? It's that type
of question that I'd like to investigate.

>> So we are maximally normalized if:
>> 1. There are no columns that can be omitted, and
>> 2. The schema is redundancy-free if we consider all derivable>
>> dependencies except the INDs.
>
> It would be interested to verify how this compares with the above. I am
> not sure whether I understand exactly what you mean by 2.

It generalizes RFNF. Note that "INDs" should be replaced with "TGDs and
EGDs that copy/equate values between different columns" where
"different" is defined as "is in a different relation or has a
different name, or both". (I forgot about EGDs in the previous posting,
but they should also be considered.) This happens in TGDs if in the
head a variable appears in a different column then it does in the tail,
and in an EGD if the two variables in the equation in the head appear
in different columns in the tail. Note this means that FDs, JDs and
MVDs stay.

So in stead of the original set of dependencies we consider a set that
implies the same set of dependencies except those that copy/equate
values between different columns. So INDs, for example, are ignored and
in that sense it is more liberal then IDNF. The reasoning here is that
the redundancy that is caused by these dependencies cannot be removed
anyway by taking projections, except if you can omit some of the
involved columns. So it makes sense to ignore that type of redundancy.

>> Final note on the type of operations we can use when normalizing; I>
>> suspect we should also allow equi-joins, or actualy any conjunctive>
>> query.
> So, your starting point is an arbitrary set of schemas instead of one,
> to which you add FDs and inter-relational constraints as per your
> requisites, and then you apply some rule to split as well as merge the
> schemas. Sounds challenging :)

That's because I'm first trying to establish what the problem is that
we *should* be solving. In practic you probably would like to join if
it means you can get rid of a few INDs and it doesn't introduce new
redundancy. However, I will no doubt start making simplifications once
I really start to investigate this, and assuming that we only do
projections would be very likely one of them. :-) But I think it needs
to be made clear that this is indeed a simplifying assumption.

> Btw, is there any intrinsic difference in considering several schemas
> instead of a single universal schema, as far as this process is
> concerned?

I don't know. The only way of finding out is to formulate both cases
and see if there is a difference. It is to me not a priori clear
whether if I focus within a database schema on a single relation, its
normalization will be influenced by non-local dependencies, especially
since I'm considering such large classes of them.

-- Jan Hidders

Jan Hidders

unread,

May 11, 2013, 6:20:14 AM5/11/13

to

Answer: yes, the choice of dependency matters. Example:

R{A, B, C, D} with { B -> C, C -> D }

Verify this is not in BCNF and so also not in gRFNF(*). Suppose I start
with splitting of B -> D (which of course also holds), then I get:

R1{B, D} R2{B, C} R3{A, B}

But this still allows redundancy because of B -> C (now translated to
the corresponding EGD in this schema), and so is not in gRFNF. Yet,
there is a better form that does not have this problem:

R1{B, C} R2{C, D} R3{A, B}

Conjecture: if you start with a single relation with a set of FDs then
you can reach gRFNF while conservatively normalizing iff the Bernstein
synthesis algorithm gets you there.

(*) By gRFNF I mean my generalzation of RFNF as presented earlier.

-- Jan Hidders

Jan Hidders

unread,

May 12, 2013, 1:16:59 PM5/12/13

to

On 2013-05-10 15:34:34 +0000, Jan Hidders said:

At the risk of turning this group into a monologue, let me explain this
a bit more.

The class of dependencies that Levene and Vincent are considering
consists of FDs and INDs. But that class is not closed under
decomposition when conservatively normalizing. Let me explain what I
mean by that. Assume we have a schema R(A, B, C) with { R:A->B, R:B-> C
}. If I split into R1(A, B) and R2(A, C) which dependencies would hold
there. If I just consider those in FD+IND I get: { R1:A->B, R2:A->C,
R1[A]=>R2[A}, R2[A]=>R1[A] }. (I'm using here => to indicate the IND.)
But to be really information equivalent the schema should have an
additional dependency that represents the old dependency R:B->C. So we
need a larger class of dependencies such that if the original schema is
normalized by splitting relations the result can always be turned into
an information equivalent schema by simply adding all dependencies in
that class that follow from the dependencies in the original schema. It
is in that sense that the class of dependencies needs to be closed
under decomposition.

Another requirement is that the class of dependencies should also be
closed under removal of redundant columns and relations. Note that this
is posible if we allow in our normalization procedure that the new
schema is defined as a set of relations where each is a projection of
one of the old relations. This is subtly different from the more usual
definition where the new schema is equal to the old except that some
relations are decomposed into projections such that these cover all
columns of the original relation. The latter definiition does not allow
the removal of redundant columns and relations.

An interesting (large) class that seems to satisfy these requirements
is the class containing EGDs and TGDs, but there might be also smaller
classes possible. For example, the class where we generalize FDs to
Q[X->Y] where Q is an algebra expression consisting of natural joins
and projections, and it has the meaning the for the result of Q the FD
X->Y holds. Likewise we generalize INDs to Q1=>Q1 with the semantics
that the result of Q1 is a subset of Q2. I will call these gFDs and
gINDs. Note that strictly speaking gINDs do not generalize INDs because
we can only have relationships between columns with the same name. (So
R[A,B]=>S[A,C] is not allowed, the result of the two queries must have
the seam header.) This class seems sufficient if you start of with just
one relation and only dependencies of this class. It might even be
possible that reasoning over these dependencies is decidable.

Anyway, the problem then is that the trick of L&V to separate the FDs
and INDs in the definition of the normal form does not work anymore
because the gINDs sometimes express dependencies that cause local
redundancy. In particular they allow us to express MVDs and JDs: the JD
*(AB, AC) for relation R can be expressed as R[A,B] * R[A,C] => R where
* represents the natural join. So which gINDs need to be considered
when checking value-redundancy, and which can we ignore? In this case
its simple: for checking value-dependency you only consider the gINDs
that copy value to a column in a relation only if it was already in
that column. E.g., (R[A,B] * S[B,C])[A,B] => R[A,B] should be
considered but (R[A,B] * S[B,C])[A,B] => S[A,B] should not. The
coneptual / philosophica justification of that is that redundancy that
is caused by such dependencies cannot be resolved by taking projections
anyway.

-- Jan Hidders

hugoko...@gmail.com

unread,

May 23, 2013, 8:43:46 AM5/23/13

to

Sorry for my long absence, everyone. When there were no new replies after a week, I had stopped checking. And now I come back and see that in the mean time, a long and very interesting discussion has taken place.

Thanks for your additional insights, Jan, Erwin, Philip, and Nicola! (I hope I missed nobody)

On Wednesday, May 1, 2013 11:20:47 AM UTC+2, Jan Hidders wrote:
>
> But if you have a realistic example that would show otherwise,
>
> that would be interesting.

I do indeed have a realistic (well, somewhat) example that illustrates the problem that started this topic. I have already presented it earlier in the discussion, but I'll repeat it. And I will reword it slightly to address the concern, expressed in some messages, that the issue can be an artifact of the initial table design.

As mentioned before - this is not an example I came across for real, I made it up to check for myself if the set of FDs I presented in the first message would even be possible. The example is a bit far-fetched, but I think realistic enough that it could be real.

A charity is organizing a big fundraiser event. Several rock bands and several opera singers will participate. Throughout the country, in different locations, there will be performances where one rock band and one opera singer will sing one song together. This event will last a whole month. On any given date in that month, there may be performances in different locations. At any given location, there may be multiple performances at different dates.
A band can play more than one song on a given date, but not with the same opera singer. They might have another performance with the same singer at another date, either at the same or another location, and either with the same or another song. If a band does play with multiple opera singers at the same date, it may be the same or a different song.
An opera singer can play more than one song on a given date, but not with the same band. They might have another performance with the same band at another date, either at the same or another location, and either with the same or another song. If an opera singe does play with multiple bands at the same date, it may be the same or a different song.

Translating the above in functional dependencies results in: (Band, Singer, Date) --> Location and (Band, Singer, Date) --> Song.

But there is one additional requirement. If a band plays multiple songs on the same date, they all have to be in the same location. And the same restriction goes for opera singers. Though technically possible to have a band or singer perform in New York and London on the same day, the organization lacks the travel budget for it. So now we have (Band, Date) --> Location and (Singer, Date) --> Location. One of the original FDs, (Band, Singer, Date) --> Location, is now not a full FD anymore, as it is implied by these two new FDs.

Note that the entire description makes no mention at all of tables. It's just the functional requirements, and the resulting FDs. So there is no initial table design that can cause artifacts.

As far as I can see, there are no other FDs implied by the scenario. I apply Bernsteins algorithm to the FDs, and get three relations:
R1: {Band, Singer, Date, Song} with (Band, Singer, Date --> Song)
R2: {Band, Date, Location} with (Band, Date --> Location)
R3: {Singer, Date, Location} with (Singer, Date --> Location)

The inclusion dependencies are totally obvious, but for completeness sake I'll mention them:
1: R1.(Band, Date) --> R2.(Band, Date)
2: R2.(Band, Date) --> R1.(Band, Date) (*)
3: R1.(Singer, Date) --> R3.(Singer, Date)
4: R3.(Singer, Date) --> R1.(Singer, Date) (*)
(*) These two are, unfortunately, not enforcable by most popular RDBMS's. But let's assume for now that we have one that doesn't suffer from this limitation. This restriction on most RDBMS's is sufficiently known that I expect the application code to enforce the dependency.

Now suppose that Placido Domingo and Fun are scheduled to play "Imagine" in London on July 15, 2013. Placido calls and says he can't make it to London, he has to be in New York for other obligations. The organization has three options: cancel this performance, reschedule for another date, or move it to New York.
Choice 1: Cancel. Remove the tuple from R3, and INDs 4 and 1 force you to remove (or do it automatically, if defined as cascading) the performance of Placido and Fun of Imagine, and the appearance of Fun in London. That is okay; this is what you would expect in this scenario if a singer has to cancel.
Choice 2: Reschedule. Change the date in R3 to July 17, and again INDs 4 and 1 force you to change (or do it automatically, if defined as cascading) the date in R1 and R2 as well. So Imagine will now be played by Fun and Placido on May 17, and Fun will be in London at that time. Again, just what we want.
Choice 3: And here's the problem. I can update the location for (Placido, July 15) in R3 to New York, and the RDBMS will accept the location. So unless someone wakes up at the right time, we will have Fun in London and Placido in New York for their joint performance of Imagine. Let's hope someone sets up the sattelite connection in time! ;)

I expect that, based on this realistic example, you'll reaffirm that there is no better way than this, and that I'll just need more code to ensure that singers and bands that are supposed to perform together will, actually, do so at the same location. But if this example does open the door to alternative solutions, I'm all ears!

Cheers,
Hugo

Jan Hidders

unread,

May 23, 2013, 12:43:42 PM5/23/13

to

On 2013-05-23 12:43:46 +0000, hugoko...@gmail.com said:

> Sorry for my long absence, everyone. When there were no new replies
> after a week, I had stopped checking. And now I come back and see that
> in the mean time, a long and very interesting discussion has taken
> place.
>
> Thanks for your additional insights, Jan, Erwin, Philip, and Nicola! (I
> hope I missed nobody)
>
>
> On Wednesday, May 1, 2013 11:20:47 AM UTC+2, Jan Hidders wrote:
>>
>> But if you have a realistic example that would show otherwise,>> that
>> would be interesting.
>
> I do indeed have a realistic (well, somewhat) example that illustrates
> the problem that started this topic. I have already presented it
> earlier in the discussion, but I'll repeat it. And I will reword it
> slightly to address the concern, expressed in some messages, that the
> issue can be an artifact of the initial table design.
>
> As mentioned before - this is not an example I came across for real, I
> made it up to check for myself if the set of FDs I presented in the
> first message would even be possible. The example is a bit far-fetched,
> but I think realistic enough that it could be real.

No problem. Not real, but realistic, is fine. And I agree that it is that.

It's to some extent nitpicking, but FDs only make sense in the context
of a single table, so in that sense you have assumed the existence of a
table. However, I still think it is a good example of what you would
like to demonstrate.

> As far as I can see, there are no other FDs implied by the scenario. I
> apply Bernsteins algorithm to the FDs, and get three relations:
> R1: {Band, Singer, Date, Song} with (Band, Singer, Date --> Song)
> R2: {Band, Date, Location} with (Band, Date --> Location)
> R3: {Singer, Date, Location} with (Singer, Date --> Location)
>
> The inclusion dependencies are totally obvious, but for completeness
> sake I'll mention them:
> 1: R1.(Band, Date) --> R2.(Band, Date)
> 2: R2.(Band, Date) --> R1.(Band, Date) (*)
> 3: R1.(Singer, Date) --> R3.(Singer, Date)
> 4: R3.(Singer, Date) --> R1.(Singer, Date) (*)
> (*) These two are, unfortunately, not enforcable by most popular
> RDBMS's. But let's assume for now that we have one that doesn't suffer
> from this limitation. This restriction on most RDBMS's is sufficiently
> known that I expect the application code to enforce the dependency.

These are indeed all inclusion dependencies, but not all dependencies
that should be maintained. If you are conservatively normalizing (i.e.
do not allow the liberalization of the meaning of the components, and
want your targe schema to be infomation equivalent) you should also add
the dependency that states that if a band and singer are performing a
song together on a certain date, then both band and singer should be
in the same location. That's representable as an EGD, but not FD or IND.

> Now suppose that Placido Domingo and Fun are scheduled to play
> "Imagine" in London on July 15, 2013. Placido calls and says he can't
> make it to London, he has to be in New York for other obligations. The
> organization has three options: cancel this performance, reschedule for
> another date, or move it to New York.
> Choice 1: Cancel. Remove the tuple from R3, and INDs 4 and 1 force you
> to remove (or do it automatically, if defined as cascading) the
> performance of Placido and Fun of Imagine, and the appearance of Fun in
> London. That is okay; this is what you would expect in this scenario if
> a singer has to cancel.
> Choice 2: Reschedule. Change the date in R3 to July 17, and again INDs
> 4 and 1 force you to change (or do it automatically, if defined as
> cascading) the date in R1 and R2 as well. So Imagine will now be played
> by Fun and Placido on May 17, and Fun will be in London at that time.
> Again, just what we want.
> Choice 3: And here's the problem. I can update the location for
> (Placido, July 15) in R3 to New York, and the RDBMS will accept the
> location. So unless someone wakes up at the right time, we will have
> Fun in London and Placido in New York for their joint performance of
> Imagine. Let's hope someone sets up the sattelite connection in time! ;)

Let's indeed. :-) But with the missing dependecy added this update
would not be possbiel.

> I expect that, based on this realistic example, you'll reaffirm that
> there is no better way than this, and that I'll just need more code to
> ensure that singers and bands that are supposed to perform together
> will, actually, do so at the same location. But if this example does
> open the door to alternative solutions, I'm all ears!

With the risc of stating the obvious: strictly speaking normalization
theory does not tell you what to do or what is best for you, it only
tells you what potential problems exist when en what you can (or
cannot) do about them. Just saying. ;-)

What I like about your example, and for me it's most important point,
as that indeed there are cases where you do want to normalize
conservatively. Being liberal here would mean for example that we
reinterpret R2 and R2 for example as "are present but not necessarily
performing", and R1 as "are scheduled to sing". But that would not even
conserve all information, unless you assume that if band and singer are
scheduled and both present they will indeed perfrom.

Secondly it illustrates that if you are normalizing conservatively it
is actually not true that you can always reach 3NF and be dependency
preserving in the sense that you don't need to introduce (rather
unusual) inter-relational dependencies. The latter notion of dependency
preserving is not how it is usually defined, but can under certain
practical circumstances be the more relevant notion.

So, well done, and thanks. :-)

-- Jan Hidders

hugoko...@gmail.com

unread,

May 29, 2013, 10:20:06 AM5/29/13

to

On Thursday, May 23, 2013 6:43:42 PM UTC+2, Jan Hidders wrote:

(snip)

> It's to some extent nitpicking, but FDs only make sense in the context
> of a single table, so in that sense you have assumed the existence of a
> table.

Okay, at the risk of goinig off-topic (and of showing my lack of formal education), but I have to ask. Is this a terminology issue? Suppose I find a FD: Student --> Mentor, and later find that this is a transitive FD because I also find Student --> Class and Class --> Mentor. For 3NF compliance, I would obviously split the table. This of course has no influence at all on the relationship between Student and Mentor. I have no problem accepting that this is now formally no longer called a functional dependency, but then what other term should I use instead?

(big snip)

> So, well done, and thanks. :-)

You're welcome!
And most of all, thank *you* for your excellent explanation! :-)

Cheers,
Hugo

Jan Hidders

unread,

May 30, 2013, 7:28:00 AM5/30/13

to

On 2013-05-29 14:20:06 +0000, hugoko...@gmail.com said:

> On Thursday, May 23, 2013 6:43:42 PM UTC+2, Jan Hidders wrote:
>
> (snip)
>> It's to some extent nitpicking, but FDs only make sense in the context>
>> of a single table, so in that sense you have assumed the existence of
>> a> table.
>
> Okay, at the risk of goinig off-topic (and of showing my lack of formal
> education), but I have to ask. Is this a terminology issue? Suppose I
> find a FD: Student --> Mentor, and later find that this is a transitive
> FD because I also find Student --> Class and Class --> Mentor. For 3NF
> compliance, I would obviously split the table. This of course has no
> influence at all on the relationship between Student and Mentor. I have
> no problem accepting that this is now formally no longer called a
> functional dependency, but then what other term should I use instead?

Let me first say that this is certainly not a big mistake to make.
There is a straightforward generalization of the concept of FD over a
single relation to one that holds over different relations: it might
mean that the FD holds over the relation that you get if you join all
the tables in the schema into a single relation by using natural joins.
In fact many textbooks assume this without saying so. But note that
this assumes we can/should indeed join all the tables with natural
joins, which in practice is not alwayes true because some column
renaming is sometimes required. This is however true if you built your
schema by starting from a single relation and then normalizing. So even
then there is somehow still the idea in the background that the whole
schema is somehow still a single relation. For normal discussions of
normalization this is something that can be easily ignored, but in this
case it was at the core of the discussion.

So what to call them then? In advanced literature on normalization you
will find the notions of EGDs (equality generating dependencies) and
TGDs (tuple generating dependencies). I'l try to give a very brief
introduction:

An FD can be considered as a special case of a first-order logic
statement. If for a relation R(A,B,C) we have the FD A->B we could also
state this as:

ForAll x, y, z : R(A : x, B : y) & R(A : x, B : z) -> y=z

Or, if we are moving to the unlabeled perspective of the relational
model with ordered columns:

ForAll x, y, z, u, v : R(x, y, u) & R(x, z, v) -> y=z

The class of EGDs is now defined by generalizing this a little. We
stick to the general form of the formula, so it should still be:

ForAll .. : .. & .. & .. -> .. = ..

but we allow any number of variables, and any number of conjuctions
over any number of different relations. So you can write for example:

ForAll x, y, z, u, v : R1(x, y) & R2(y, z) & R1(x, u) & R2(u, v) -> z=v

An that of course expresses what the FD A->C becomes if we split R(A,
B, C) into R1(A, B) and R2(B, C).

Although the class of EGDs is quite big and generalizes FDs, it does
not allow you to express MVDs and JDs, and also not INDs. Those are
generalized in a different class called TGDs, tuple generating
dependencies, which is defined in a very simillar way. Look at how we
would express the IND R1[B] => R2[B] in first order logic:

ForAll x, y, z : R1(x, y) -> Exists z : R2(y, z)

Or the MVD A ->> B in R(A, B, C):

ForAll x, y, z : R(x, y, z) & R(x, u, v) -> R(x, y, v) & R(x, u, z)

So the class of TGDs is defined by formulas of the form:

ForAll .. : .. & .. & .. -> Exists .. : .. & .. & ..

This class indeed can express all MVDs, JDs and INDs, plus the
dependencies they turn into if we start splitting tables.

I could tell more, but need to get back to work now. :-)

-- Jan Hidders

Erwin

unread,

May 30, 2013, 6:02:16 PM5/30/13

to

Op donderdag 30 mei 2013 13:28:00 UTC+2 schreef Jan Hidders het volgende:

Are EGD's in any way related to DKNF ?

Are the individual conjuncts in an EGD required to be a rel(ation/var) of the schema ?

Does the same apply to TGD's ?

If all and any individual conjuncts of a TGD are allowed to be just any arbitrary relational expression defined on the relvars, do you then think that TGD's are sufficient to express all and any possibly conceivable database constraint ?

(Sorry I have not been participating more. I said I'd love to do so, but found very little material that I could "hook onto". This is the first occasion you give me.)

com...@hotmail.com

unread,

May 30, 2013, 7:10:11 PM5/30/13

to

On Thursday, May 30, 2013 3:02:16 PM UTC-7, Erwin wrote:
> found very little material that I could "hook onto".

http://webdam.inria.fr/Alice/

philip

Jan Hidders

unread,

May 31, 2013, 4:53:20 AM5/31/13

to

Not besides the obvious connection that they generalize FDs and
therefore are connected to DKNF.

> Are the individual conjuncts in an EGD required to be a rel(ation/var)
> of the schema ?

Yes.

> Does the same apply to TGD's ?

Yes.

> If all and any individual conjuncts of a TGD are allowed to be just any
> arbitrary relational expression defined on the relvars, do you then
> think that TGD's are sufficient to express all and any possibly
> conceivable database constraint ?

Depends on how you define the language of "any arbitrary relational
expression". If that language can express any computable query, then
yes. If you mean bu that language the classical relational algbra, then
no, since you would still be within first-order logic.

-- Jan Hidders

com...@hotmail.com

unread,

Jun 1, 2013, 7:01:58 PM6/1/13

to

On Thursday, May 30, 2013 3:02:16 PM UTC-7, Erwin wrote:

Erwin,

> Are EGD's in any way related to DKNF ?

DKNF says the only constraints on a table var are key constraints. Each key constraint will be an FD from key attributes to all others. Is there a reason you think of them as otherwise related?

> Are the individual conjuncts in an EGD required to be a rel(ation/var) of the schema ?

> Does the same apply to TGD's ?

They just have to have the forms Jan showed.

> If all and any individual conjuncts of a TGD are allowed to be just any arbitrary relational expression defined on the relvars,

Do you mean EGDs-plus-TGDs? Because neither class includes the other.

A closed expression is an EGD/TGD iff it can be rewritten in the corresponding form where the Rs are table vars.

So I don't know what you are getting at by suggesting a form like that with arbitrary expressions instead of the atoms. I have the impression you have some notion of "constraining an expression", eg declaring an FD on an expression. Jan addressed this in the message you are replying to. Whatever you've written to do this is a larger containing closed expression with relvar predicates as atoms and that's the one that either is or is not in the form of (ie is or is not) one of the above dependencies. Which can also be said to "constrain the database" or "constrain its mentioned table vars" but these phrases and yours are just usages of differently-defined versions of verb "constrain" all intended be used to describe requiring that the containing closed expression hold.

Eg constraining an expression involving multiple tables to satisfy an FD turns out when rearranged to a form above to be in general a non-FD EGD. That's why FDs get replaced by EGDS when we normalize while also keeping track of sufficient constraints on components. Which is what Hugo was asking about & Jan was answering. In the student-class-mentor case your FD-constrained expression in addition to the student-class and student-mentor FDs is the join of the two post-normalization table vars and the corresponding EGD is that their projections on student are equal.

> do you then think that TGD's are sufficient to express all and any possibly conceivable database constraint ?

Not every constraint can be rewritten to these forms, ie an EGD or TGD. So no.

Eg if you rewrite the above forms in prenex nomal form (all foralls and exists at the start quantifying over only ors on the right) you will see that they limit you to certain patterns. Expressions that normalize to other patterns will be by definition of disjunctive normal form not equivalent to those that normalize to those EGD/TGD patterns.

But regarding your meaning. The constraint of having (say) two FDs hold in a var is obviously not itself the same as having some one other FD that they don't imply hold. Ie the conjunction of a bunch of those dependencies expresses yet another class of constraints and expressions. So what do you mean when you say "sufficient to express"? Probably you mean, are conjunctions of them sufficient to express every database constraint. (We normally think of a database as being constrained by a conjunction of separately expressed conjuncts that we individually call constraints.) The answer is still no.

philip