Analytics question maybe - combine over lapping date rows

stephen O'D

unread,

Jun 27, 2008, 12:01:22 PM6/27/08

to

On Oracle 10G, lets say I have a table that looks like this:

create table t1 (key varchar2(10), start_dtm date, end_dtm date);

and I have data some of which over laps:

insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'),
to_date('30/03/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'),
to_date('30/03/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'),
to_date('15/04/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'),
to_date('30/12/2001', 'DD/MM/YYYY'));

Ie
1st Jan - 30th March
2nd Feb - 30th March
15th Feb - 15th April

1st July - 30th Dec

I want to return only two rows from the 4 that look like

1st jan - 15th April
1st July - 30th Dec

ie - if the dates are continuous, squash all the continuous rows into
1 with the min start and max end date
but if there is a gap it has to become a second row. There can be any
number of rows and over laps etc.

I reckon this can be done with some analytics trickery, but I just
cannot seem to figure out how.

Any ideas?

Thanks,

Stephen.

Maxim Demenko

unread,

Jun 27, 2008, 8:03:15 PM6/27/08

to stephen O'D

stephen O'D schrieb:

Maybe, something like this

SQL> select key, min(start_dtm) start_dtm, max(end_dtm) end_dtm
2 from (select key,
3 start_dtm,
4 end_dtm,
5 sum(interval_change) over(partition by key order by
end_dtm, start_dtm) interval_no
6 from (select key,
7 start_dtm,
8 end_dtm,
9 case
10 when (lag(end_dtm)
11 over(partition by key order by end_dtm,
12 start_dtm) - start_dtm) > 0 then
13 0
14 else
15 1
16 end interval_change
17 from t1))
18 group by key, interval_no
19 order by key, interval_no
20 ;

KEY START_DTM END_DTM
---------- -------------------- --------------------
1234 01-Jan-2001 15-Apr-2001
1234 01-Jul-2001 30-Dec-2001

Best regards

Maxim

Charles Hooper

unread,

Jun 27, 2008, 9:50:32 PM6/27/08

to

Thanks for posting the DDL and DML to create the test setup. I see
that Maxim provided a solution for you in the 90 minutes that I
experimented with the problem (I initially read the requirement
wrong). Maxim's solution appears to be more elegant than what follows
(and there may still be a couple issues), but it might be interesting
to look at how a solution might be worked out.

First, we experiment with LAG so that we are able to examine the
previous row when sorted by START_DTM and END_DTM:
SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM,
LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_START_DTM,
LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM;

KEY START_DTM END_DTM M_START_D M_END_DTM
---------- --------- --------- --------- ---------
1234 01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01
1234 01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01
1234 15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01
1234 01-JUL-01 30-DEC-01

If we now slide the above into an inline view and see if the START_DTM
is greater than M_END_DTM (the previous row's END_DTM), we may have
found a new series of dates (note, I just noticed that this may have
problems if there are two rows with discrete date ranges that fall
entirely within a third, larger date range), we will output 1 if true,
or 0 if false:
SELECT
KEY,
START_DTM,
END_DTM,
M_START_DTM,
M_END_DTM,
DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
(SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM,
LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_START_DTM,
LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM);

KEY START_DTM END_DTM M_START_D M_END_DTM C
---------- --------- --------- --------- --------- ----------
1234 01-JAN-01 30-MAR-01 0
1234 01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01 0
1234 15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01 0
1234 01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01 1

In the above, a 1 is output whenever there is a jump in the date range
- if we use the COUNT analytical function to create a running count of
the 1s and slide the above into an inline view, we actually create a
column that may be used for grouping:
SELECT
KEY,
START_DTM START_DTM,
END_DTM END_DTM,
SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
KEY,
START_DTM,
END_DTM,
M_START_DTM,
M_END_DTM,
DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
(SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM,
LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_START_DTM,
LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM));

KEY START_DTM END_DTM G
---------- --------- --------- ----------
1234 01-JAN-01 30-MAR-01 0
1234 01-FEB-01 30-MAR-01 0
1234 15-FEB-01 15-APR-01 0
1234 01-JUL-01 30-DEC-01 1

We are now able to group on the column G by again sliding the above
into an inline view:
SELECT
KEY,
MIN(START_DTM) START_DTM,
MAX(END_DTM) END_DTM
FROM
(SELECT
KEY,
START_DTM START_DTM,
END_DTM END_DTM,
SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
KEY,
START_DTM,
END_DTM,
M_START_DTM,
M_END_DTM,
DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
(SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM,
LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_START_DTM,
LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY
T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
T1
ORDER BY
T1.START_DTM,
T1.END_DTM)))
GROUP BY
KEY,
G;

KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 15-APR-01
1234 01-JUL-01 30-DEC-01

Let's add a little more data to see what happens:
insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'),
to_date('29/12/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'),
to_date('15/01/2002', 'DD/MM/YYYY'));

KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 15-APR-01
1234 01-JUL-01 30-DEC-01
1234 31-DEC-01 15-JAN-02

Note the addition of the last row in the output - you may want to
determine if that should be part of the previous group of date. You
should be able to fix this by adding or subtracting a date in the
SIGN( ) function.

I originally thought that you were trying to eliminate this row from
the output:
KEY START_DTM END_DTM
---------- --------- ---------
1234 01-FEB-01 30-MAR-01

For the above, there are several methods:
SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM
FROM
T1
WHERE
(T1.KEY,
T1.START_DTM,
T1.END_DTM) NOT IN (
SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM
FROM
T1,
T1 T2
WHERE
T1.KEY=T2.KEY
AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM
AND (
T1.START_DTM<>T2.START_DTM
OR T1.END_DTM<>T2.END_DTM));

KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 30-MAR-01
1234 15-FEB-01 15-APR-01
1234 01-JUL-01 30-DEC-01

SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM
FROM
T1,
(SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM
FROM
T1,
T1 T2
WHERE
T1.KEY=T2.KEY
AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM
AND (
T1.START_DTM<>T2.START_DTM
OR T1.END_DTM<>T2.END_DTM)) T3
WHERE
T1.KEY=T3.KEY(+)
AND T1.START_DTM=T3.START_DTM(+)
AND T1.END_DTM=T3.END_DTM(+)
AND T3.KEY IS NULL;

KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 30-MAR-01
1234 15-FEB-01 15-APR-01
1234 01-JUL-01 30-DEC-01

SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM
FROM
T1
MINUS
SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM
FROM
T1,
T1 T2
WHERE
T1.KEY=T2.KEY
AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM
AND (
T1.START_DTM<>T2.START_DTM
OR T1.END_DTM<>T2.END_DTM);

KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 30-MAR-01
1234 15-FEB-01 15-APR-01
1234 01-JUL-01 30-DEC-01

SELECT
KEY,
START_DTM,
END_DTM
FROM
(SELECT
T1.KEY,
T1.START_DTM,
T1.END_DTM,
MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM)
M_START_DTM,
MAX(T1.END_DTM) OVER (PARTITION BY T1.KEY, T1.START_DTM)
M_END_DTM
FROM
T1)
WHERE
START_DTM=M_START_DTM
AND END_DTM=M_END_DTM;

KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 30-MAR-01
1234 15-FEB-01 15-APR-01
1234 01-JUL-01 30-DEC-01

Charles Hooper
IT Manager/Oracle DBA
K&M Machine-Fabricating, Inc.

stephen O'D

unread,

Jul 1, 2008, 11:30:09 AM7/1/08

to

Guys - thanks for the replies on this. I eventually worked it out too
with a third solution after dredging AskTom and the replies here ...
Here is my solution to complete the picture:

Using the same table as above and this data:

create table t1 (key varchar2(10), start_dtm date, end_dtm date);

insert into t1 values(
9999, to_date('01/01/2008', 'DD/MM/YYYY'),
to_date('01/06/2008', 'DD/MM/YYYY'));

insert into t1 values(
9999, to_date('01/02/2008', 'DD/MM/YYYY'),
to_date('01/06/2008', 'DD/MM/YYYY'));

insert into t1 values(
9999, to_date('01/02/2008', 'DD/MM/YYYY'),
to_date('01/05/2008', 'DD/MM/YYYY'));

insert into t1 values(
9999, to_date('01/02/2008', 'DD/MM/YYYY'),
to_date('01/07/2008', 'DD/MM/YYYY'));

This gives a set of rows like:

|------------------------------------------|
|---------------------------------------|
|---------------------|
|------------------------------------------|

And we expect this to turn into a single row like

01/01/08 |----------------------------------------------| 01/07/2008

insert into t1 values(
9999, to_date('01/08/2008', 'DD/MM/YYYY'),
null);

insert into t1 values(
9999, to_date('01/09/2008', 'DD/MM/YYYY'),
to_date('01/10/2008', 'DD/MM/YYYY'));

|--------------------------------------->
|------------|

And we want an row that looks like

01/08/2008 |----------------------------> (null end date)

I think this query does it:

select key, min(start_dtm), max(end_dtm)
from
(
select key, start_dtm, end_dtm, max (grp) over (partition by key
order by start_dtm asc) grp2
from
(
SELECT key,
start_dtm,
NVL (end_dtm, '31-dec-4712') end_dtm,
case
when
(start_dtm not between lag(start_dtm) over (partition
by key order by start_dtm asc)
and nvl( lag(end_dtm) over (partition by key order
by start_dtm asc), '31-dec-4712') )
or lag(start_dtm) over (partition by key order by
start_dtm asc) is null
then
row_number() over (partition by key order by
start_dtm asc)
end grp
FROM t1
order by key, start_dtm asc
) a
) b
group by key, grp2

GRP MIN(START MAX(END_D
--------------------------------- --------- ---------
09999 01-JAN-08 01-JUL-08
09999 01-AUG-08 31-DEC-12

2 rows selected.

Basically I said, if you order the rows by increasing start date, then
if rows overlap, the current start_dtm must be between the previous
start_dtm and end_dtm - if there is no previous start_dtm or they
don't overlap, then its a new group.

The innermost select outputs something like:

KEY START_DTM END_DTM GRP
--------------------------------- --------- --------- ----------
09999 01-JAN-08 01-JUN-08 1
09999 01-FEB-08 01-JUL-08
09999 01-FEB-08 01-JUN-08
09999 01-FEB-08 01-MAY-08
09999 01-AUG-08 31-DEC-12 5
09999 01-SEP-08 01-OCT-08

Then we use analytics again to fill in the blanks in the enclosing
query (b):

KEY START_DTM END_DTM GRP2
--------------------------------- --------- --------- ----------
09999 01-JAN-08 01-JUN-08 1
09999 01-FEB-08 01-JUL-08 1
09999 01-FEB-08 01-JUN-08 1
09999 01-FEB-08 01-MAY-08 1
09999 01-AUG-08 31-DEC-12 5
09999 01-SEP-08 01-OCT-08 5

and then simply group by KEY, GRP in the outer query to get the
result.

I am glad I went through the pain of figuring this out (took me a good
90 minutes) as I really feel like I get analytics now. It also
allowed me to replace a temporary table and several 100 lines of PLSQL
that was performing rather poorly and written by someone who doesn't
get analytics!

Thanks again,

Stephen.

joel garry

unread,

Jul 1, 2008, 1:43:50 PM7/1/08

to

On Jul 1, 8:30 am, "stephen O'D" <stephen.odonn...@gmail.com> wrote:

> I am glad I went through the pain of figuring this out (took me a good
> 90 minutes) as I really feel like I get analytics now. It also
> allowed me to replace a temporary table and several 100 lines of PLSQL
> that was performing rather poorly and written by someone who doesn't
> get analytics!
>
> Thanks again,
>
> Stephen.

I keep feeling like I get it, don't use it enough, then don't get it
any more. But this helped: http://www.orafaq.com/usenet/comp.databases.oracle.server/2007/08/03/0224.htm

jg
--
@home.com is bogus.
http://catless.ncl.ac.uk/Risks/25.21.html#subj8

jeffe...@gmail.com

unread,

Nov 16, 2012, 10:28:13 AM11/16/12

to

The solution by Maxim does not handle collapsing date ranges such as:
11-01-2012 11-09-2012
11-02-2012 11-03-2012
11-04-2012 11-05-2012
11-08-2012 11-15-2012
This should consolidate down to a single date range: 11-01-2012 11-15-2012, but the third and fourth ranges will be detected as an interval change and you end up with 3 ranges. After much googling and experimenting, I finally found a solution posted here http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/b5dd3d09-7299-4c5c-8e9d-5f06b93a9cbd, which I modified to handle null stop dates and tested on large sets of date ranges. Here is the code:

SELECT s1.primary_key,
s1.start_datetime,
MIN(t1.stop_datetime) AS stop_datetime
FROM your_table s1
JOIN your_table t1
ON s1.primary_key = t1.primary_key
AND (s1.start_datetime <= t1.stop_datetime -- s1 starts before t1 ends
OR t1.stop_datetime IS NULL)
AND NOT EXISTS -- filter out all but the t1's with max stop time
(SELECT NULL
FROM your_table t2
WHERE t1.primary_key = t2.primary_key
-- t1 stop time falls within t2 (in which case filter out t1)
AND t1.stop_datetime >= t2.start_datetime
AND (t1.stop_datetime < t2.stop_datetime
OR t2.stop_datetime IS NULL)
)
WHERE NOT EXISTS
(SELECT NULL
FROM your_table s2
WHERE s1.primary_key = s2.primary_key
-- s1 start time falls within s2 (in which case filter out s1)
AND s1.start_datetime > s2.start_datetime
AND (s1.start_datetime <= s2.stop_datetime
OR s2.stop_datetime IS NULL)
)
GROUP BY s1.primary_key, s1.start_datetime
ORDER BY s1.primary_key, s1.start_datetime
;