Join Intervals in dplyr

1,965 views
Skip to first unread message

my.r...@gmail.com

unread,
Mar 8, 2014, 4:52:51 AM3/8/14
to manip...@googlegroups.com
In dplyr, is it possible to do joins based on intervals, e.g. something like this (artificial example for illustration below), where for simplicity `mydate` is the number of days since 1970-01-01 (to avoid problems with date conversions between R and SQL). 

SELECT * 
FROM tab_a a
LEFT JOIN tab_b b
ON a.mydate > b.mydate

I think something like this would be very useful, e.g. for time series or longitudinal data. 

my.r...@gmail.com

unread,
Mar 9, 2014, 6:04:22 AM3/9/14
to manip...@googlegroups.com
Just for clarification, I know that you can do stuff like that using `left_join` for example. But as far as I understand the documentation, you can only join based on equality (sort of like `ON a.mydate = b.mydate`). You can't join based on ineqalities (like `ON a.mydate > b.mydate` below) or date ranges (like `ON a.mydate BETWEEN b.mydate AND b.mydate+8), although this is possible in SQL. This is what I mean by time *interval*. But maybe there is a way to do this in dplyr, so that's why I'm asking. Thanks. 

Hadley Wickham

unread,
Mar 10, 2014, 12:42:30 PM3/10/14
to my.r...@gmail.com, manipulatr
If we add this (it's quite a bit of work), I think the syntax would be
something like:

cross_join(a, b, filter = x.mydate > y.mydate)

In an ideal world, you'd be able to do

cross_join(a, b) %.% filter(a.mydate > b.mydate)

but currently all local data frame commands execute immediately, so
that would materialise the complete cross-join, which is a really bad
idea.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
http://had.co.nz/

my.r...@gmail.com

unread,
Mar 10, 2014, 10:24:11 PM3/10/14
to manip...@googlegroups.com, my.r...@gmail.com
Hadley,

Thanks, I think maybe a solution is to add an operator that delays the
execution? Something like this (roughly speaking):

delay(cross_join(a, b)) %.% filter(a.mydate > b.mydate)

Not sure whether this is possible.


Hadley Wickham

unread,
Mar 10, 2014, 10:56:34 PM3/10/14
to my.r...@gmail.com, manipulatr
If such an operator existed, then cross_join would use it
automatically. It might be possible for cross_join to work lazily (and
hence behave differently to other dplyr functions), the challenge is
figuring out how to efficiently do things like summarise and mutate.

Hadley

Michael Smith

unread,
Mar 10, 2014, 11:02:11 PM3/10/14
to Hadley Wickham, manipulatr
Hadlay,

About your original idea of putting it into a single command, I have two
additional points.

1) Maybe your suggestion of

cross_join(a, b, filter = a.mydate > b.mydate)

can first be implemented for the SQL backend only, and later for
data.frame and data.table. This would be relatively easy to do (I hope),
since it already works in SQL, so it would "only" be a matter of
translating it to an SQL command.


2) I was thinking more of an extension of the existing *_joins, since
for example my SQL code in my earlier post is an inner join, not a cross
join. For example, add a new argument, say, `on` and then use it for
example like

inner_join(a, b, on = a.mydate > b.mydate & a.mydate <= b.mydate + 14)

This way people could still use the `by` argument in the usual way, but
for more complicated joins they could use `on`.

Hadley Wickham

unread,
Mar 14, 2014, 8:45:44 AM3/14/14
to Michael Smith, manipulatr
> 2) I was thinking more of an extension of the existing *_joins, since
> for example my SQL code in my earlier post is an inner join, not a cross
> join. For example, add a new argument, say, `on` and then use it for
> example like
>
> inner_join(a, b, on = a.mydate > b.mydate & a.mydate <= b.mydate + 14)
>
> This way people could still use the `by` argument in the usual way, but
> for more complicated joins they could use `on`.

Except that's no longer an inner join, so the function name becomes confusing.

Hadley

--
http://had.co.nz/

Jarosław Nirski

unread,
Apr 7, 2014, 5:27:34 PM4/7/14
to manip...@googlegroups.com, Michael Smith
I guess a rolling join is what would help here. And would be much welcome in dplyr anyway. It's an extremely useful and powerful (and fast) feature of data.table, known also as last observation carried forward (LOCF). But, as Matt Dowle points out, not easily translated into SQL because SQL is inherently unordered.

Jarek
Reply all
Reply to author
Forward
0 new messages