Normalization?

123 views
Skip to first unread message

Alex Cruise

unread,
Jul 23, 2009, 1:26:08 PM7/23/09
to Functional Relational Programming
Hi folks! Get out the AED, this group's not quite dead yet. :)

I'm playing with implementing a toy FRelP system ("FRP" has been taken
over--damn you Conal Elliott! ;) and am curious about the repeated use
of "address" fields (aliased for string) as candidate/foreign keys in
most of the relations in the real estate example.

In the modern mainstream version of relational dogma this raises red
flags; we're supposed to refactor the repeated data into a combination
of a new relation and some number of foreign keys pointing back to
it.

Is it an intention of FRelP that there be no arbitrarily assigned
identifiers, such as the primary/foreign keys we typically use today?
It somehow seems wrong to use a candidate key (i.e. some data that
happens to be unique) as a foreign key; surely a more direct reference
to the Property relation is called for?

My copy of The Third Manifesto is in the mail, maybe it will turn me
around completely. :)

Thanks!

-0xe1a

Ben Moseley

unread,
Aug 8, 2009, 3:38:14 AM8/8/09
to frp-d...@googlegroups.com
Hi Alex,

On 23 Jul 2009, at 18:26, Alex Cruise wrote:
> I'm playing with implementing a toy FRelP system

Cool!

> ("FRP" has been taken
> over--damn you Conal Elliott! ;)

To be fair, Functional Reactive Programming did actually come first !

> and am curious about the repeated use
> of "address" fields (aliased for string) as candidate/foreign keys in
> most of the relations in the real estate example.
>
> In the modern mainstream version of relational dogma this raises red
> flags; we're supposed to refactor the repeated data into a combination
> of a new relation and some number of foreign keys pointing back to
> it.

"refactor the repeated data" - I think it's interesting to dig into
what you mean here.

You probably have in mind something like an "AddressID" which might be
some integer or another and using that instead of the "address" type
(alias) in the relvars of the example, maybe along with another relvar
to record (AddressID, String).

One thing to consider though is that what you've done is to replace
one piece of repeated data with another (albeit 'smaller') one.

As an aside, I've often heard people refer to designs such as this
example as "unnormalized" which is inaccurate (all normalizations are
defined with respect to some set of functional dependencies and we're
not talking about any fundeps here).

I think that this issue arises because genuine lack of normalization
leads to a database in which fewer single-tuple updates are logically
acceptable (you'd need to update all functionally dependent values
simultaneously to avoid violating the fundep), and (I believe) you're
pointing out that you couldn't "change" an address in a single tuple
without violating constraints.

My feeling is that the similarities between these two situations are
however largely superficial. In the former we're concerned with
information represented by the relations themselves (if we think
formally then relations over uninterpreted symbols). In the latter
you're thinking about information "represented" by a value itself (ie
an address in String form). I would argue that this latter kind of
thing is strictly outside the relational model.

Clearly there are some practical issues here, and for some time I've
wondered whether it would make sense to have some kind of "stratified"
system whereby lower tiers could define universes of values over which
higher tiers would define relations. In such a system "address" would
be defined in a base "tier", and the relvars in the example would all
live in a higher tier. Such a system would need to provide
increasingly tight restrictions on updates to lower tiers, as any
changes to a given tier could potentially invalidate arbitrary
constraints at higher tiers.

> Is it an intention of FRelP that there be no arbitrarily assigned
> identifiers, such as the primary/foreign keys we typically use today?

It is certainly my feeling that systems shouldn't impose arbitrarily
assigned identifiers on users /unless/ they actually add value to the
user interface (vehicle registrations, social security numbers) etc.

I also think that often the underlying reasons for using such
arbitrary identifiers stem from various technical limitations in
current systems.

> It somehow seems wrong to use a candidate key (i.e. some data that
> happens to be unique) as a foreign key;


I don't at all see why. This seems to me to be a natural way to model
many things.

Maybe you could expand on your concerns?

--Ben

Ben Moseley

unread,
Aug 8, 2009, 3:40:23 AM8/8/09
to frp-d...@googlegroups.com

On 23 Jul 2009, at 18:26, Alex Cruise wrote:
> I'm playing with implementing a toy FRelP system...

I'd be interested in collaborating on developing an open source FRP
system in Haskell - let me know if you'd be interested in being
involved in that.

Cheers,

--Ben

Alex Cruise

unread,
Aug 8, 2009, 9:44:17 AM8/8/09
to Functional Relational Programming
On Aug 8, 3:40 am, Ben Moseley <mosel...@googlemail.com> wrote:
> I'd be interested in collaborating on developing an open source FRP  
> system in Haskell - let me know if you'd be interested in being  
> involved in that.

Mine's in Scala, so it's even better! ;)

-0xe1a

Alex Cruise

unread,
Aug 9, 2009, 11:03:18 PM8/9/09
to Functional Relational Programming
On Aug 8, 3:38 am, Ben Moseley <mosel...@googlemail.com> wrote:
> On 23 Jul 2009, at 18:26, Alex Cruise wrote:
> > In the modern mainstream version of relational dogma this raises red
> > flags; we're supposed to refactor the repeated data into a combination
> > of a new  relation and some number of foreign keys pointing back to
> > it.
>
> "refactor the repeated data" - I think it's interesting to dig into  
> what you mean here.
>
> You probably have in mind something like an "AddressID" which might be  
> some integer or another and using that instead of the "address" type  
> (alias) in the relvars of the example, maybe along with another relvar  
> to record (AddressID, String).

Actually, my other question with respect arbitrary identifiers is an
indication of my desire to avoid the ID as well. It seems to me that
there should be a concept of first-class, *typed* references to rows.
For instance, the system shouldn't let you store a PropertyRef value
in a RoomRef attribute, even though their internal representation
might be the identical type.

> I think that this issue arises because genuine lack of normalization  
> leads to a database in which fewer single-tuple updates are logically  
> acceptable (you'd need to update all functionally dependent values  
> simultaneously to avoid violating the fundep), and (I believe) you're  
> pointing out that you couldn't "change" an address in a single tuple  
> without violating constraints.

It's possible that it's the choice of the address attribute that's at
the root of my unease, but I don't think so. An address has obvious
semantic content to humans ("keys with business meaning"), is very
likely to be subject to edits, and is quite likely to become non-
unique for business reasons (e.g. multiple buyers and/or sellers at
the same address), and is even somewhat likely to be broken apart into
separate attributes (e.g. house number, street name...).

To my mind, the half-typed foreign key values that are in current wide
use are a necessary (but not sufficient) layer of abstraction between
tuple values, and references to them. I think it's a very desirable
property of normalized systems that one can change most aspects of a
relation (e.g. data representation, constraints, etc.) without
modifying any of the relations that refer to it.

> Clearly there are some practical issues here, and for some time I've  
> wondered whether it would make sense to have some kind of "stratified"  
> system whereby lower tiers could define universes of values over which  
> higher tiers would define relations. In such a system "address" would  
> be defined in a base "tier", and the relvars in the example would all  
> live in a higher tier. Such a system would need to provide  
> increasingly tight restrictions on updates to lower tiers, as any  
> changes to a given tier could potentially invalidate arbitrary  
> constraints at higher tiers.

It's an interesting thought, but I don't think Address is a
qualitatively different kind of thing from Property, it just happens
to be used by it, and several other relations. If you could identify
some useful abstractions over relations I think those might be a good
starting point for thinking about an addition to the type system.

> I also think that often the underlying reasons for using such  
> arbitrary identifiers stem from various technical limitations in  
> current systems.

I agree, but I strongly believe that first-class references are a
better choice than making direct use of candidate key values, whether
they're human-readable or generated.

> > It somehow seems wrong to use a candidate key (i.e. some data that
> > happens to be unique) as a foreign key;
> I don't at all see why. This seems to me to be a natural way to model  
> many things.

I think that from an intuitive standpoint it makes sense to make use
of unique candidate/foreign key values with business meaning, but I
still buy the argument against them, which I think has come about much
more from bitter experience than theoretical concerns.

Thanks,

-0xe1a

Ben Moseley

unread,
Aug 16, 2009, 3:21:08 AM8/16/09
to frp-d...@googlegroups.com
On 10 Aug 2009, at 04:03, Alex Cruise wrote:

Actually, my other question with respect arbitrary identifiers is an
indication of my desire to avoid the ID as well.  It seems to me that
there should be a concept of first-class, *typed* references to rows.
For instance, the system shouldn't let you store a PropertyRef value
in a RoomRef attribute, even though their internal representation
might be the identical type.

My first reaction is that this sounds like re-inventing something vaguely OO-ish. Personally I'm not at all keen on "Ref" types or other forms of explicit reference - for me one of the major strengths of the relational model is that it leaves relationships implicit, and anyone is free to use any relationships which are of use - irrespective of which attributes they use to form them.


I think that this issue arises because genuine lack of normalization  
leads to a database in which fewer single-tuple updates are logically  
acceptable (you'd need to update all functionally dependent values  
simultaneously to avoid violating the fundep), and (I believe) you're  
pointing out that you couldn't "change" an address in a single tuple  
without violating constraints.

It's possible that it's the choice of the address attribute that's at
the root of my unease, but I don't think so.  An address has obvious
semantic content to humans ("keys with business meaning"),
is very
likely to be subject to edits, and is

quite likely to become non-
unique for business reasons (e.g. multiple buyers and/or sellers at
the same address),

That is a separate issue that would be represented with separate relations. I think you can still say that the address itself is still "unique".

and is even somewhat likely to be broken apart into
separate attributes (e.g. house number, street name...).

Ultimately this depends on what you're trying to do. It would make perfect logical sense to have an "AddressComponents" relation with attributes: (a :: address, houseNumber :: Int, street :: String) - this only seems weird when you stop thinking of the attribute values as being atomic. Still, I think that some kind of a stratified system might make sense.

To my mind, the half-typed foreign key values that are in current wide
use are a necessary (but not sufficient) layer of abstraction between
tuple values, and references to them.

I don't personally see any need for any explicit concept of references - and hence I don't see any need for any kind of abstraction of this kind.

I think it's a very desirable
property of normalized systems that one can change most aspects of a
relation (e.g. data representation, constraints, etc.) without
modifying any of the relations that refer to it.

In general integrity constraints can refer to any relations - hence in general you can't expect to be able to update an arbitrary relvar without needing to change others too - indeed you wouldn't want that to be possible.


Clearly there are some practical issues here, and for some time I've  
wondered whether it would make sense to have some kind of "stratified"  
system whereby lower tiers could define universes of values over which  
higher tiers would define relations. In such a system "address" would  
be defined in a base "tier", and the relvars in the example would all  
live in a higher tier. Such a system would need to provide  
increasingly tight restrictions on updates to lower tiers, as any  
changes to a given tier could potentially invalidate arbitrary  
constraints at higher tiers.

It's an interesting thought, but I don't think Address is a
qualitatively different kind of thing from Property, it just happens
to be used by it, and several other relations.  If you could identify
some useful abstractions over relations I think those might be a good
starting point for thinking about an addition to the type system.

To be honest, none of this stuff really counts as a "type system" yet - it really needs to be formalized properly - but anyway it's not a matter of being "qualitatively different" but simply that you need to have some base set of values before you can meaningfully talk about relations over them, and if you want to talk about some kind structure internal to those values (possibly even admitting some concept of "change" over time) then that can be done, and can be done in a relational way, but if anything about those values 'changes' in some sense then all bets are off wrt any relations you had previously constructed out of them. Hence my thought about introducing a strictly stratified system as a way to do this kind of thing safely.


I also think that often the underlying reasons for using such  
arbitrary identifiers stem from various technical limitations in  
current systems.

I agree, but I strongly believe that first-class references are a
better choice than making direct use of candidate key values, whether
they're human-readable or generated.

I personally believe that first-class references have no place in the relational model (at least what most people mean when they say "first class references").


It somehow seems wrong to use a candidate key (i.e. some data that
happens to be unique) as a foreign key;
I don't at all see why. This seems to me to be a natural way to model  
many things.

I think that from an intuitive standpoint it makes sense to make use
of unique candidate/foreign key values with business meaning, but I
still buy the argument against them, which I think has come about much
more from bitter experience than theoretical concerns.

I think it's important to be clear about the context we're talking in.  I'm talking about what I think makes logical sense and could in-principle sensibly be supported by a suitable software stack. What does or doesn't make sense (or leads to bitter experience) given some particular existing set of technologies is a very different question !

--Ben

Reply all
Reply to author
Forward
0 new messages