Query for the first or last item in a group

2,515 views
Skip to first unread message

JKWA

unread,
Mar 30, 2012, 4:05:52 PM3/30/12
to bigquery...@googlegroups.com
We consume a table with about million records that looks something akin to.

Dave, 20090213, Music, ProgramA
John, 20101001, Music, ProgramB
John, 20100101, Math , ProgramC
Dave, 20101112, Music, ProgramB
John, 20110901, Music, ProgramA

We need to extract tables that can answer questions like, "For each person and subject, which program is associated with the earliest date entry?" (like the table below).

Dave, Music, ProgramA
John, Math , ProgramC
John, Music, ProgramB

In the past I've always just pulled the table into SPSS or SAS to make this kind of query (aggregate and return first row based on the sorted values of one or more columns).

I understand this is possible in BigQuery using WITHIN.  But even if we were allowed us to load this type of schema, it is not the type of table we're consuming.

Am I overlooking something?  Any suggestions or solutions?


Joe

Jordan Tigani

unread,
Apr 2, 2012, 11:44:22 AM4/2/12
to bigquery...@googlegroups.com
I don't think there is a good way to do the query that you want right now with BigQuery. The underlying query technology that BigQuery uses _does_ support queries like this (you could use NEST to group the records in a repeated field and then WITHIN to find the earliest date entry). However, exposing this support would be somewhat non-trivial, since there are non-intuitive limitations to how this query would run that are tied to the layout of the data and the design of the query engine. 

So in short, I don't think we'll have a solution for you in the short term. Longer term, there are a couple of things we plan to add that could solve your problem, but I don't want to speculate on when they'll be available.

-jordan 

Michael Sheldon

unread,
Apr 2, 2012, 12:04:43 PM4/2/12
to bigquery...@googlegroups.com
Siddartha suggested that a self-join might work, and this would be the "classic" SQL solution to this problem. Ah, I just found a reference: http://community.gibraltarsoftware.com/forums/yaf_postst3904_RANK--Using-a-SELF-JOIN-or-other.aspx

Looking at other databases, there are custom functions like "rank" that you could sub-select with and then choose the rows where rank == 1. A statistically accurate and/or limited-window rank function would be helpful for making this efficient.

--Michael Sheldon

JKWA

unread,
Apr 2, 2012, 12:48:57 PM4/2/12
to bigquery...@googlegroups.com

Thanks guys, this is really helpful.


Sanjath Shringeri

unread,
Apr 2, 2012, 5:25:28 PM4/2/12
to bigquery...@googlegroups.com
Doing the join like in the RANK example may not work as joins in BQ are restrictive, BQ doesn't like the OR clause in ON condition. Not sure if I am missing something here.

I think first and last aggregation is probably challenging on a columnar datastore as compared to row based datastores. 

What we have done for a similar query is, we are using self join but using MIN(), MAX() aggregate function to identify the row and join using that value from min() and max() to get the first and last row.  Since the column that we used in min() and max() was also the primary key for the row, it made it easy. 

 - Sanjath
--
___________________________________
Sanjath Shringeri | VP, Engineering | claritics | Social. Intelligence. Now. 
Come visit us at GDC 2012 in San Francisco, booth & session
408.796.1287 |

Reply all
Reply to author
Forward
0 new messages