How to model disjoint date ranges?

Paul

unread,

Jul 17, 2002, 5:01:17 AM7/17/02

to

I have an attribute that changes with time in discrete steps e.g.
if something has a title that can change every now and then and we
need to store the history of previous titles.

Currently the table is like:

CREATE TABLE foo (
titleid int,
startdate datetime,
title varchar(20),
CONSTRAINT PK_foo PRIMARY KEY (titleid, startdate)
)

INSERT INTO foo VALUES(1, '1900.01.01', 'AAA')
INSERT INTO foo VALUES(1, '2000.03.05', 'BBB')
INSERT INTO foo VALUES(1, '2001.05.22', 'CCC')

but I don't think this is normalised because in order to get the full
information about the row you need to look at the other rows in the
table with the same titleid to get the end of the date range. Also
when joining this table a correlated subquery is needed. like:

SELECT * FROM bar
JOIN foo
ON foo.titleid = bar.titleid
AND foo.startdate = (SELECT Max(startdate) FROM foo
WHERE titleid = foo.titleid
AND startdate <= bar.actualdate)

Next thought is this:

CREATE TABLE foo (
titleid int,
startdate datetime,
enddate datetime,
title varchar(20)
)

INSERT INTO foo VALUES(1, NULL, '2000.03.04', 'AAA')
INSERT INTO foo VALUES(1, '2000.03.05', '2001.05.21', 'BBB')
INSERT INTO foo VALUES(1, '2001.05.22', NULL, 'CCC')
INSERT INTO foo VALUES(2, NULL, '2002.10.24', 'ZZZ')
INSERT INTO foo VALUES(2, '2002.10.25', NULL, 'YYY')

with maybe some constraint that forces the date ranges to be disjoint.
But somehow this doesn't feel quite right either. Also we might have
to have low and high values instead of NULLs for the dates if we
define a primary key.

I'm not sure if it's a deficiency of my particular DBMS (SQL Server)
but foreign keys can only be defined that are standard joins i.e. not
"theta joins" (is this the right terminology?)

Is there a ideal way to do this?
If not, what are the relative merits of the options?

Paul.

--CELKO--

unread,

Jul 21, 2002, 5:22:54 PM7/21/02

to

>> I have an attribute that changes with time in discrete steps e.g.
if something has a title that can change every now and then and we
need to store the history of previous titles. <<

I prefer to do this with (start_date, end_date) pairs. The logic is
that the duration is part of the fact that you are modeling. The pair
is really like an (x,y) co-ordinate -- it has two parts in its
representation, but it is a single atomic attribute.

CREATE TABLE Foobar
(foo_key INTEGER NOT NULL PRIMAREY KEY,
...
start_date TIMESTAMP NOT NULL,
end_date TIMESTAMP, --null means still on-going
...);

When you do queries, the NULL in the end_date gives you what you need
to always have the right facts:

To only the things still open:

CREATE VIEW CurrentFoobar (..., start_date, end_date, ..)
AS SELECT ... start_date, CURRENT_TIMESTAMP, ...
FROM Foobar
WHERE end_date IS NULL;

To see the most recent examples, including those that are closed:

CREATE VIEW MostRecentFoobar (..., start_date, end_date, ..)
AS SELECT ... start_date, COALESCE(end_date,CURRENT_TIMESTAMP), ...
FROM Foobar
WHERE end_date IS NULL
OR end_date
= (SELECT MAX(endate)
FROM Foobar AS F1
WHERE F1.foo_key = Foobar.foo_key);

You get the idea.

Larry Coon

unread,

Jul 22, 2002, 12:19:56 PM7/22/02

to

--CELKO-- wrote:

> I prefer to do this with (start_date, end_date) pairs. The logic is
> that the duration is part of the fact that you are modeling. The pair
> is really like an (x,y) co-ordinate -- it has two parts in its
> representation, but it is a single atomic attribute.

I've been discussing this with him over e-mail, and one of
the points is that this depends on whether the intervals
allow gaps or overlaps. If gaps or overlaps are allowed,
then storing start_date & end_date is right. But if there
are no gaps or overlaps, then storing start_date AND
end_date means the same fact is being stored multiple times
in your data. "This" interval ends when the next interval
begins, and the begin date of the next interval is already
stored.

> CREATE TABLE Foobar
> (foo_key INTEGER NOT NULL PRIMAREY KEY,
> ...
> start_date TIMESTAMP NOT NULL,
> end_date TIMESTAMP, --null means still on-going
> ...);

In our data (university curriculum), you have courses with
titles, unit values, repeatability, transferability, etc.,
independently changing over time. So the table to track
title changes might look like:

CREATE TABLE course_title (
course_id integer not null,
title varchar(255),
start_date datetime,
end_date datetime,
foreign key (course_id) references course(course_id),
primary key (course_id, start_date)
)

But with this schema, it's easy to generate data anomalies:

INSERT INTO course_title
VALUES (12345, "Beginning Algebra", "10/1/95", "10/1/01")

INSERT INTO course_title
VALUES (12345, "Algebra I", "10/1/99", "10/1/03")

But we did find that queries are easier to write when you
have both the start date and the end date....

> When you do queries, the NULL in the end_date gives you what you need
> to always have the right facts:
>
> To only the things still open:
>
> CREATE VIEW CurrentFoobar (..., start_date, end_date, ..)
> AS SELECT ... start_date, CURRENT_TIMESTAMP, ...
> FROM Foobar
> WHERE end_date IS NULL;
>
> To see the most recent examples, including those that are closed:
>
> CREATE VIEW MostRecentFoobar (..., start_date, end_date, ..)
> AS SELECT ... start_date, COALESCE(end_date,CURRENT_TIMESTAMP), ...
> FROM Foobar
> WHERE end_date IS NULL
> OR end_date
> = (SELECT MAX(endate)
> FROM Foobar AS F1
> WHERE F1.foo_key = Foobar.foo_key);
>
> You get the idea.

We ended up with sentinel dates rather than nulls, so
comparison operators will continue to work with them, which
is not true of nulls.

To see the courses that were active on 10/1/01:

select ...
from course
where start_date <= '10/1/01'
and end_date > '10/1/01'

Larry Coon
University of California
la...@assist.org
and lmc...@home.com

Steve Kass

unread,

Jul 22, 2002, 7:12:13 PM7/22/02

to

Larry,

If you want to see more than the tip of the iceberg here, read
"Developing Time-Oriented Database Applications in SQL,"
by Richard T. Snodgrass. It's a tremendous book.

Steve Kass
Drew University

--CELKO--

unread,

Jul 23, 2002, 3:12:01 PM7/23/02

to

>> I've been discussing this with him over e-mail, and one of the
points is that this depends on whether the intervals allow gaps or
overlaps. If gaps or overlaps are allowed, then storing start_date &
end_date is right. But if there
are no gaps or overlaps, then storing start_date AND end_date means
the same fact is being stored multiple times in your data. "This"
interval ends when the next interval begins, and the begin date of the
next interval is already
stored. <<

I disagree. The duration is part of an atomic fact; you are trying to
find your end point in another row, not in the row where it belongs.

CREATE TABLE Events
(event_name CHAR(5) NOT NULL PRIMARY KEY,
start_date DATE NOT NULL);

INSERT INTO Events VALUES
('a', '2002-01-01'),
('b', '2002-01-10'),
('c', '2002-01-20');

Which I would show as:

CREATE TABLE Events
(event_name CHAR(5) NOT NULL PRIMARY KEY,
start_date DATE NOT NULL,
end_date DATE);

INSERT INTO Events VALUES
('a', '2002-01-01', '2002-01-09'),
('b', '2002-01-10', '2002-01-19'),
('c', '2002-01-20', NULL);

Now, let's do this:

DELETE FROM Events
WHERE event = 'b';

You have a deletion anomaly that creates the falsehood that we had a
business rule to close up the gaps:

('a', '2002-01-01', '2002-01-19'),
('c', '2002-01-20', NULL);

>> > We ended up with sentinel dates rather than nulls, so comparison
operators will continue to work with them, which is not true of nulls.
<<

Sentinel dates do not work! Not every SQL engine takes '9999-12-31'
so the code blows up when you port it. It is also a lie; that
sentinel dates is a real date and you will be checking for all over
your code,unable to tell true from sentinel dates. NULLs were
designed for missing data, so use them. I wish we had an ETERNITY
token in SQL, but it is too late for that.

Oh, minor point, ISO-8601 is the ONLT temporal format allowed in
Standard SQL and string are enclosed in single quotes, not double
quotes.

>> But with this schema, it's easy to generate data anomalies: <<

If you have a business rule that you do not allow gaps, then enforce
it with a CHECK() constraint or a CREATE ASSERTION statement.

CHECK(
NOT EXISTS
(SELECT *
FROM Events AS E1
GROUP BY event
HAVING MAX(end_date) - MIN(start_date)
> SUM(end_date - start_date +1)))

(Pardon that I did not do the full SQL-92 syntax for the date math)
Likewise you can do one for preventing overlaps

CHECK(
NOT EXISTS
(SELECT *
FROM Events AS E1, Events AS E2
WHERE E1.event = E2.event
AND E1.start_date BETWEEN E2.start_date AND E2.end_date))

I think I got those right, but I did not check my work.

--CELKO--

unread,

Jul 23, 2002, 3:33:48 PM7/23/02

to

>> I've been discussing this with him over e-mail, and one of the
points is that this depends on whether the intervals allow gaps or
overlaps. If gaps or overlaps are allowed, then storing start_date &
end_date is right. But if there
are no gaps or overlaps, then storing start_date AND end_date means
the same fact is being stored multiple times in your data. "This"
interval ends when the next interval begins, and the begin date of the
next interval is already
stored. <<

The complete fact is the duration, not just the start of it. The
begin date for the "next" (whatever that means in a set oriented
model) interval is part of that fact. Consider this:

CREATE TABLE Events
(event CHAR(5) INTEGER NOT NULL PRIMARY KEY,
start_date TIMESTAMP NOT NULL);

INSERT INTO Events

('a', '2002-01-01'),
('b', '2002-01-10'),
('c', '2002-01-20');

Which I would show as

CREATE TABLE Events
(event CHAR(5) INTEGER NOT NULL PRIMARY KEY,

start_date TIMESTAMP NOT NULL,
end_date TIMESTAMP,

CHECK (<no gaps>),
CHECK (<no overlaps>));

INSERT INTO Events

('a', '2002-01-01', '2002-01-09'),
('b', '2002-01-10', '2002-01-19'),
('c', '2002-01-20', NULL);

Since your data model is wrong, you have no way of adding the two
constraints.

Now do

DELETE FROM Events
WHERE event = 'b';

Your model will, in effect, do this to the table:

('a', '2002-01-01', '2002-01-19'), <== changed!

('c', '2002-01-20', NULL);

Where did that rule come from? Likewise, if I wanted to insert a new
event, 'b1', on '2002-01-17', I would, in effect, have:

('a', '2002-01-01', '2002-01-16'), <== changed!
('b', '2002-01-17', '2002-01-19'),

('c', '2002-01-20', NULL);

I think that demonstates anomalies.

>> We ended up with sentinel dates rather than nulls, so comparison
operators will continue to work with them, which is not true of nulls.
<<

Use COALESCE() with NULL and you are fine. The sentinel dates are a
mess; which one do you use? The particular max date for your
particular product? A lot of products blow up on '999-12-31', the
highest date allowed in ISO-8601. Then you have to embed that
sentinel date everywhere in your code -- a bad programming practice.

We put NULLs in SQL for just this reason; use them.

SQL-92 has an OVERLAPS predicate for testing durations, so that one is
easy write.

CONSTRAINT no_gaps

CHECK (NOT EXISTS
(SELECT *
FROM Events AS E1

GROUP BY E1.event
HAVING (MIN(start_date) - MAX(end_date)) +1
> SUM(start-date - end_date +1)))

--CELKO--

unread,

Jul 23, 2002, 3:44:20 PM7/23/02

to

>> I've been discussing this with him over e-mail, and one of the
points is that this depends on whether the intervals allow gaps or
overlaps. If gaps or overlaps are allowed, then storing start_date &
end_date is right. But if there
are no gaps or overlaps, then storing start_date AND end_date means
the same fact is being stored multiple times in your data. "This"
interval ends when the next interval begins, and the begin date of the
next interval is already
stored. <<

The complete fact is the duration, not just the start of it. The

begin date for the "next" (whatever that means in a set oriented
model) interval is part of that fact. Consider this:

CREATE TABLE Events
(event CHAR(5) INTEGER NOT NULL PRIMARY KEY,
start_date TIMESTAMP NOT NULL);

INSERT INTO Events
('a', '2002-01-01'),
('b', '2002-01-10'),
('c', '2002-01-20');

Which I would show as

CREATE TABLE Events
(event CHAR(5) INTEGER NOT NULL PRIMARY KEY,

start_date TIMESTAMP NOT NULL,
end_date TIMESTAMP,

CHECK (<no gaps>),
CHECK (<no overlaps>));

INSERT INTO Events
('a', '2002-01-01', '2002-01-09'),
('b', '2002-01-10', '2002-01-19'),
('c', '2002-01-20', NULL);

Since your data model is wrong, you have no way of adding the two
constraints.

Now do

DELETE FROM Events
WHERE event = 'b';

Your model will, in effect, do this to the table:

('a', '2002-01-01', '2002-01-19'), <== changed!
('c', '2002-01-20', NULL);

Where did that rule come from? Likewise, if I wanted to insert a new
event, 'b1', on '2002-01-17', I would, in effect, have:

('a', '2002-01-01', '2002-01-16'), <== changed!
('b', '2002-01-17', '2002-01-19'),
('c', '2002-01-20', NULL);

I think that demonstates anomalies.

>> We ended up with sentinel dates rather than nulls, so comparison

operators will continue to work with them, which is not true of nulls.
<<

Use COALESCE() with NULL and you are fine. The sentinel dates are a