Treating addresses in 3NF?

Rich Leigh

unread,

Nov 12, 2002, 11:26:40 AM11/12/02

to

I know that in a lot (i.e. almost all?) databases, a customer list which
included an address is usually stored as CustomerID, Customer Name,
Address1, Address2, City, Post/Zip code, blah blah. But as a purely, uh,
"theoretical exercise", i.e. I don't have a bet on it with anybody at all
;), when this is normalised as far as 3NF, I come up with the tables;
Customers (CustID, CustName, AddressID)
Addresses (AddressID, Zip/Postcode)
AddressLines (SequenceNo, AddressID, Data)

I'm not entierly sure how the zipcode system works in the states, but here
in sunny Britain our quaint post code system works such that from the
postcode, you can work out a small (say, less than 20) number of dwellings
that the code applies to. The road, area and city are fully determined by
it.

The problem with embedding the address in the table is clearly that
(Address1, Postcode) --> (Address2, City), which is a dependency that
shouldn't be in the table, and needs to be normalised out to another one.

So that was exactly what I did first, but then came up against the problem
that you can easily get multi-lined addresses, and sticking in extra
columns into my new separate address table would be just so wrong that
it hurts =)

So I ended up with the situation that I have a table of singleton lines
from an address, which have a sequence number (i.e. 1 for first line, 2
for second line, etc. up to as many lines as the address needs) and are
linked to the AddressID (as I have now lost access to the house number
which would be in address line 1).

My point? Is this really "The One True And Correct Way"(tm) of storing
address information in 3NF? I know fully that nobody in their right mind
would ever consider doing it this way. That probably says quite a bit
about me, in fact. Heh. :)

Also, how would I now do a search for all customers in, pick a place...
Plymouth? select customername from customers inner join addresses A on
customers.address=addresses.id, address_lines L where A.id = L.address and
l.data = 'Plymouth'; ? Is there any way to speed this up so it doesn't
query every single line of all the addresses? Especially as I bet some
nutter would call his house "Plymouth" like, Plymouth, No 23, Some Road,
London. Would a table of Postcode->City, and then check that the Postcode
for the addressID was in the list of postcodes that matched the city work?
Or could my data can now get out of sync with itself, since the city is
also stored as a line in the table of single address lines? Aaaaagh!

--
Mmmm.
Richard Leigh <rot13>egy101 ng rpf qbg fbgba qbg np qbg hx</rot13>

Jan Hidders

unread,

Nov 12, 2002, 1:16:59 PM11/12/02

to

In article <Pine.LNX.4.44.02111...@demon.ecs.soton.ac.uk>,

Rich Leigh <rtl...@ecs.soton.ac.uk> wrote:
>I know that in a lot (i.e. almost all?) databases, a customer list which
>included an address is usually stored as CustomerID, Customer Name,
>Address1, Address2, City, Post/Zip code, blah blah. But as a purely, uh,
>"theoretical exercise", i.e. I don't have a bet on it with anybody at all
>;), when this is normalised as far as 3NF, I come up with the tables;
>Customers (CustID, CustName, AddressID)
>Addresses (AddressID, Zip/Postcode)
>AddressLines (SequenceNo, AddressID, Data)

I'm affraid you have lost the bet. :-) Just to begin with, normalization
does not involve introducing new identifiers. So let's take a look at your
relation:

Customer(ID, Name, Address1, Address2, City, Zip_code )

How do we know that the two addresses have the same city and zip code? We
probably don't so let's fix that:

Customer(ID, Name, Addr1, City1, Zip1, Addr2, City2, Zip2)

Next question: what are the candidate keys? The usual suspect is {ID} but if
two persons with the same name live at the same address then there is
probably something wrong, so {Name, Addr1, City1} and {Name, Addr2, City2}
are very likely also candidate keys. But since the zip code implies the city
we also have the candidate keys {Name, Addr1, Zip1} and {Name, Addr2, Zip2}.

So what functional dependencies are there? Since ID is the identifier we
have the trivial:

(1) ID -> Name, Addr1, City1, Zip1, Addr2, City2, Zip2

But we also know that the zip code implies the city:

(2) Zip1 -> City1
(3) Zip2 -> City2

Finally we also know that the zip code is determined by the address and
city:

(4) Addr1, City1 -> Zip1
(5) Addr2, City2 -> Zip2

That's about it. Actually we should now check if we have a minimal cover,
but we have and I won't bore you with the technical details. So what does
3NF say:

For every non-trivial FD X->A with A a non-key attribute it holds that X
is a candidate key.

So do we violate the 3NF? For (1) it is easy to see that it doesn't since X
= {ID} and a candidate key. For (2) (3) (4) and (5) this is also easy to see
because City1, City2, Zip1 and Zip2 are all part of a candidate key and
therefore key attributes and the left-hand sides (the X in X->A) do not have
to be candidate keys.

So we are already in 3NF and nothing needs to be done.

But what if we allow people with the same name at the same address? Let's
even assume that it is possible that two persons with the same name live at
the same two addresses. (Otherwise you are still in 3NF.) In this case (2),
(3), (4) and (5) all violate 3NF so we will have to split off some of these.
The best choice for this is (4) and (5) because if you split off (2) and (3)
the dependencies (4) and (5) get distributed over two tables and we don't
want that. So you get:

Customer(ID, Name, Addr1, City1, Addr2, City2)
Zip1(Addr1, City1, Zip1)
Zip2(Addr1, City1, Zip1)

Since Zip1 and Zip2 are simply subsets of the zip-code book they can be
joined into one relation:

Customer(ID, Name, Addr1, City1, Addr2, City2)
Zip(Addr, City, Zip)

So are we done now? We still have to check if the new relation is in 3NF.
What are the candidate keys? It's just {Addr, City}. Are there any FDs? Yes,
we have one that goes with the candidate key:

(4') Addr, City -> Zip

This is clearly not a problem in 3NF because {Addr, City} is the candidate
key. Then there is the one that says that the zip code determines the city:

(3') Zip -> City

and this is also not a problem because City is in the candidate key and
therefore a key attribute. So we have reached now 3NF.

-- Jan Hidders

Paul Vernon

unread,

Nov 12, 2002, 12:51:32 PM11/12/02

to

You mean to tell me that a prestigious, highly efficient organisation like the
British Royal Mail does not provide everyone with a fully normalised
relational model with their postcode and address data, but rather gives those
willing to *pay* for the data, a set of 'flat files' and some Cobol copybook
definitions, so forcing us developers to (try to) create normalised models for
their data?! No, surly not.

Surly what happens is that such prestigious organisations that 'own' data
vital to the smooth running of the economy subscribe to an international
standard on the exchange of data (that is obviously based on the relational
model, being as it is, the 'best' way of logically representing data for
multiple applications) and publish their data using such a standard, which
(amongst many other things) would allow all databases that use this data to
work off the same logical model

:-)

Regards
Paul Vernon
Business Intelligence, IBM Global Services

Paul

unread,

Nov 13, 2002, 8:24:32 AM11/13/02

to

hid...@hcoss.uia.ac.be (Jan Hidders) wrote in message news:<3dd1459a$1...@news.uia.ac.be>...

> In article <Pine.LNX.4.44.02111...@demon.ecs.soton.ac.uk>,
> Rich Leigh <rtl...@ecs.soton.ac.uk> wrote:
> >I know that in a lot (i.e. almost all?) databases, a customer list which
> >included an address is usually stored as CustomerID, Customer Name,
> >Address1, Address2, City, Post/Zip code, blah blah. But as a purely, uh,
> >"theoretical exercise", i.e. I don't have a bet on it with anybody at all
> >;), when this is normalised as far as 3NF, I come up with the tables;
> >Customers (CustID, CustName, AddressID)
> >Addresses (AddressID, Zip/Postcode)
> >AddressLines (SequenceNo, AddressID, Data)
>
> I'm affraid you have lost the bet. :-) Just to begin with, normalization
> does not involve introducing new identifiers. So let's take a look at your
> relation:
>
> Customer(ID, Name, Address1, Address2, City, Zip_code )
>
> How do we know that the two addresses have the same city and zip code? We
> probably don't so let's fix that:
>
> Customer(ID, Name, Addr1, City1, Zip1, Addr2, City2, Zip2)

I think this might be a misinterpretation of the original post.
I think what was meant by Address1, Address2 is the separate address
*lines* of the address e.g.:

Address1: 123 Acacia Avenue
Address2: Anytown
City: London
PostCode: SW12 45T

the problem is that you can get addresses like:

Flat 12,
"The Gables",
3 Leafy Ct,
High St,
AnyVillage,
AnyTown,
AnyCity,
Countyshire,
UK.
AB12 34C

though I think the Post Office does have the concept of a canonical
form for an address to help get round this problem (e.g. you leave out
the village since the postcode can identify it). So I don't think you
need to allow for the possibility of an arbitrary number of address
lines. Although I guess it depends if you are cleaning the data before
it goes in the database or allowing address entry as the person wants
and cleaning it later.

Now AB12 will always be in the county of Countyshire.
and AB12 34C will be a group of about 10? houses (usually?) in the
same road, maybe with sequential house numbers but maybe not as well.
I think the combination of postcode and house number (or name if it
doesn't have a number) uniquely identifies the address. Though
sometimes I think companies in a single building with large amounts of
mail have their own unique postcode or maybe even multiple postcodes
(for different departments).

So the pair (postcode, house name or number) should in theory uniquely
identify the addess. Maybe actually you need the flat number in there
as well to allow for multiple addresses in the same building.

Another problem is that the counties used by the Post Office are
different to the counties used for administrative purposes. e.g.
someone in Greater London could be in the (non-existent) county of
Middlesex for postal reasons...

Paul.

Paul Vernon

unread,

Nov 13, 2002, 9:51:31 AM11/13/02

to

"Paul" <pbra...@cosmos-uk.co.uk> wrote in message
news:51d64140.02111...@posting.google.com...

>
> Now AB12 will always be in the county of Countyshire.

Well 'AB12' -> POSTAL_TOWN
and POST_TOWN -> COUNTY
and COUNTY -> COUNTRY

but AB12 does not impliy that your house is actually in COUNTY, nor for that
matter COUNTRY.

E.g. Search for 'HOME FARM' in 'SY13' on the address finder on
www.royalmail.com and, although they do not show COUNTRY, I know that half of
those farms are in Wales and half in England!

If anyone is serious about UK postal addresses, you need to digest:

http://www.royalmail.com/docContent/other/Downloadable_Files/PAF_Digest_Issue_
5_0.pdf

And try to find all the hidden functional dependencies :-)

Then try the same for the 200 or so other other postal address 'standards'
around the world.

neil

unread,

Nov 14, 2002, 3:20:46 PM11/14/02

to

"Rich Leigh" <rtl...@ecs.soton.ac.uk> wrote in message
news:Pine.LNX.4.44.02111...@demon.ecs.soton.ac.uk...

> I know that in a lot (i.e. almost all?) databases, a customer list which
> included an address is usually stored as CustomerID, Customer Name,
> Address1, Address2, City, Post/Zip code, blah blah. But as a purely, uh,
> "theoretical exercise", i.e. I don't have a bet on it with anybody at all
> ;), when this is normalised as far as 3NF, I come up with the tables;
> Customers (CustID, CustName, AddressID)
> Addresses (AddressID, Zip/Postcode)
> AddressLines (SequenceNo, AddressID, Data)

I have a less erudite comment than the other magnificent posts (I have
printed Jan's out to fully appreciate it later).

AddressLines.Data is a field that can apparently contain address lines or
City. This means that the domain has to be a simple string data type.
Following on from this logic, you only need one table to express the whole
address. I think it is better to keep the address lines and city in separate
domains and in that case they must be in separate columns to be in 1NF.

A different point. I think the main reason to store vague column names like
Address1 and Address2 is to be able to count lines on labels and restrict
input to a certain line width. I use bigger labels and allow a single
Address field to have returns embedded in them.

Mike Sherrill

unread,

Nov 15, 2002, 10:12:24 AM11/15/02

to

On Tue, 12 Nov 2002 16:26:40 +0000, Rich Leigh
<rtl...@ecs.soton.ac.uk> wrote:

>I know that in a lot (i.e. almost all?) databases, a customer list which
>included an address is usually stored as CustomerID, Customer Name,
>Address1, Address2, City, Post/Zip code, blah blah.

Surely not "Customer Name", but "Addressee"? They're logically
distinct attributes that sometimes (often?) have the same value.

>But as a purely, uh,
>"theoretical exercise", i.e. I don't have a bet on it with anybody at all
>;), when this is normalised as far as 3NF, I come up with the tables;
>Customers (CustID, CustName, AddressID)
>Addresses (AddressID, Zip/Postcode)
>AddressLines (SequenceNo, AddressID, Data)

Hmmm. Maybe it would help to stop thinking about dependencies for a
little while, and think about the simple facts represented by an
address. Work on a conceptual model.

--
Mike Sherrill
Information Management Systems

David Cressey

unread,

Nov 16, 2002, 11:36:31 AM11/16/02

to

> So let's take a look at your
> relation:
>
> Customer(ID, Name, Address1, Address2, City, Zip_code )
>
> How do we know that the two addresses have the same city and zip code? We
> probably don't so let's fix that:
>

PMFJI. Are you sure that Address1 and Address2 are not two lines on a
single address?
For example:

<address1>Suite 503
<address2>1125 Wisconsin Ave.

--CELKO--

unread,

Nov 16, 2002, 9:40:19 PM11/16/02

to

>> You mean to tell me that a prestigious, highly efficient
organisation like the British Royal Mail does not provide everyone

with a fully normalised relational model ... <<

For 'UK postcode databases', the Royal Mail holds the copyright and
will licence you an 'outwards' database (i.e. the first segment of the
full postcode) for about 150 UK pounds a year, or an 'inwards'
database (i.e. the full 9- or 10-character postcode) for about 500 UK
pounds per year.

The U.S. zipcodes are nested geographical areas of the form
abbbcc-ddee. The first digit is a multi-state region, the next three
are part of a state, the next two are postal districts within a city
or county.

Sometimes a zipcode belongs to a single entity, like a hospital,
government agency office, university, etc.

The stuff after the dash are trickier:

1) The first two "zip+4" digits are the postal carrier route and the
last two are neighborhoods within the route. The idea is to phsyical
sort the mail for easy delivery.

2) If the zipcode is a post office, the last four digits are all or
part of a box number.

zip -> city
zip -> state
(zip, zip+4) ->> (street, city, state)

Jan Hidders

unread,

Nov 17, 2002, 5:49:03 PM11/17/02

to

David Cressey wrote:
>> So let's take a look at your
>> relation:
>>
>> Customer(ID, Name, Address1, Address2, City, Zip_code )
>>
>> How do we know that the two addresses have the same city and zip code? We
>> probably don't so let's fix that:
>>
>
>PMFJI. Are you sure that Address1 and Address2 are not two lines on a
>single address?

My mistake, I had forgotten that British addresses are represented different
than ours. Actually, there is also another error in the normalization
process I described. There's bonus if you find it. ;-)

-- Jan Hidders

Paul Vernon

unread,

Nov 17, 2002, 5:38:05 PM11/17/02

to

"--CELKO--" <71062...@compuserve.com> wrote in message
news:c0d87ec0.02111...@posting.google.com...

> >> You mean to tell me that a prestigious, highly efficient
> organisation like the British Royal Mail does not provide everyone
> with a fully normalised relational model ... <<
>
> For 'UK postcode databases', the Royal Mail holds the copyright and
> will licence you an 'outwards' database (i.e. the first segment of the
> full postcode) for about 150 UK pounds a year, or an 'inwards'
> database (i.e. the full 9- or 10-character postcode) for about 500 UK
> pounds per year.

Thanks Joe, although my point is that you don't get a relational model for
your 150 or 500 pounds. Why is that? Why no international standard on
relational database model descriptions?

Also, it always occurs to me that it must cost the Royal Mail much more
because of the inefficiency due to baldy addressed post than they could ever
make from selling their postcodes. I.e. if the postcode database was free,
more post would be address correctly and they would *save* money. But then I
should not expect to see much sense from a publicly owned organisation.

Theo Peterbroers

unread,

Nov 18, 2002, 4:16:25 AM11/18/02

to

"Paul Vernon" <paul....@ukk.ibmm.comm> wrote in message news:<ar95q6$lkg$3...@sp15at20.hursley.ibm.com>...

> "--CELKO--" <71062...@compuserve.com> wrote in message
> news:c0d87ec0.02111...@posting.google.com...
> > >> You mean to tell me that a prestigious, highly efficient
> > organisation like the British Royal Mail does not provide everyone
> > with a fully normalised relational model ... <<

<snip>

> Thanks Joe, although my point is that you don't get a relational model for
> your 150 or 500 pounds. Why is that? Why no international standard on
> relational database model descriptions?

<snip>

> Regards
> Paul Vernon
> Business Intelligence, IBM Global Services

There is an international standard under way:
http://www.addressstandard.com/

It may be more 'business oriented' rather than 'database oriented' though.