A query too far? Inner join with group by...

152 views
Skip to first unread message

Joe Barnhart

unread,
Feb 17, 2014, 6:38:26 AM2/17/14
to web...@googlegroups.com
So I'm trying to use the DAL to create all of my queries, but I may have just exceeded its capability.

My task is to create a view of "best times" for a particular sport.  The sport has multiple events and the user is interested in collecting their best times over the entire database, organized by event.  The SQL looks something like this:

select * from times as t
inner join
(
 
select course, event_code, min(finals) as "min_time" from times
   
where id_competitor=XXXX and finals>0
   
group by course, event_code
 
) as tt
on t
.event_code=tt.event_code
 
and t.course=tt.course
 
and t.finals=tt.min_time
 
and t.id_competitor=XXXX
order
by t.course, t.event_code

The XXXX represents a specific literal ID number for the competitor.  This query is to be run for each competitor when they choose to see the view, so limiting it to a single person is the desired approach.  

The inner join choose the best (minimum) time for each course and event_code.  

Disqualifications are stored as negative times, so the clause to restrict finals>0 is to exclude any of these outliers.

I've been studying up on the alternate join syntax for the DAL but I don't see a way to handle the conditions such as "finals>0" in the new syntax.  If the "on" method could be attached to a Set object instead of only a Table, maybe I could approach this query.

But I thought I'd ask the real SQL experts before I give up.  Also, what is my fallback if I need to do this query outside of DAL?

-- Joe B.

Richard Vézina

unread,
Feb 17, 2014, 10:32:57 AM2/17/14
to web2py-users
Hello Joe,

Found your plain sql query already convoluted, there is not simpler way to express it in SQL first? You seem to use join as if it was a where, join not intent for this, using it like that will make thing slower...

Could this work :

subquery = db.executesql('''select course, event_code, min(finals) as "min_time" from times
    
where id_competitor=XXXX and finals>0

    
group by course, event_code''')

rows = db((db.times.event_code.belongs(subquery[1])) & (db.times.course.belongs(subquery[0]) & (db.times.min_time.belongs(subquery[2])) & (db.times.competitor=='XXXX').select(db.times.ALL)

I can't try 

:(

Richard


--
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
---
You received this message because you are subscribed to the Google Groups "web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web2py+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Joe Barnhart

unread,
Feb 17, 2014, 3:24:26 PM2/17/14
to web...@googlegroups.com
Hi Richard --

I thought about a subquery...  But I've never seen or heard of the trick of using "subquery[0]" to pick out one column of the subquery.  

I see your approach is to materialize the subquery as a Rows object and then pick out elements.  I don't think this will work as expected tho because the "belongs" logic will choose incorrect values.  For example say the time is 10 seconds, but there are 20 different rows (NOT "best" times) with the same time.  Your approach will choose the 10 sec. time for ALL combinations of event code and course -- not just the SPECIFIC combination in the row that matches the "best" time..  

That's why I chose a "join" -- it forces the elements to match perfectly.  But thanks for the technique.  The idea of materializing the Rows and using it in another query bears thinking about.

-- Joe

Richard Vézina

unread,
Feb 17, 2014, 3:51:07 PM2/17/14
to web2py-users
But what you do in the join should be reproducible in a where clause... I start rewriting your join into a where clause at first and realize that I would need to rewrite 3-4 for time the same subquery in sql or I had to write a recursive query that would not help in translation into web2py query...

I don't see how the multiples belongs would not lead to the samething as you query did you try it?

You still have the option of just using db.executesql(''' your actual sql query ''')

Richard

Joe Barnhart

unread,
Feb 17, 2014, 4:46:23 PM2/17/14
to web...@googlegroups.com
Hi Richard --

Don't need to try it...  Do a thought experiment...  The "belongs" matches the entire set of all affected times, event_codes and courses.  It becomes the cartesian product of all combinations of time, event_code, and course.  It matches N*M*K entries in the times table, where N, M, and K are the counts of the members of the set.  By contrast, the number of rows in the minimum times query is a tiny fraction of that.

Let's say I have a competitor with a minimum time of 10 sec. in event 1, course A.  He may also have that exact time (on a bad day) for event 2, course B.  In this scenario, event 2 course B would normally be a much faster time -- in fact his "best" times has such a faster time.  But when you do the overall query with "belongs" on ALL members of times, event_code, and course, you will match BOTH 10 sec. times -- event 1 on course A and event 2 on course B, so both times will appear in the final result.  (Along with the true best time for event 2, course B, which was faster than 10 sec.)

I could probably do a correlated subquery to solve this problem, but those are famously inefficient.  The join, convoluted as it is (and I agree there) is still the best way.

What I really wish I could do is use a select to return the min(time) and the ID of the row it occurred on.  But GROUP BY won't allow that.  One of the limitations of the set theory underpinning SQL.  That's where the correlated subquery comes it -- do a select with ORDER BY on time, then choose LIMIT BY 0,1 to get only one row.  But that query has to be repeated for every combination of course and event_code.  I have about 20 rows to search for a typical competitor and each query above takes around 700 ms. so that's not a good option!

-- Joe B.

Richard Vézina

unread,
Feb 18, 2014, 9:59:45 AM2/18/14
to web2py-users
So... You seem to be ready for a stored procedure... Or you need to thought to your problem into more then one step, maybe having a view for the sub query...

I was to porpose the multiple subquery, since your query is already slow, I would try this option, just to see if it couldn't be a bit faster...

For slow query that for which I don't want to invest more time to create a function or stored procedure, I generally create a view and cache it with web2py...

Richard

Derek

unread,
Feb 18, 2014, 2:53:19 PM2/18/14
to web...@googlegroups.com
Joe, it seems to me all you need is this part here:
select course, event_code, min(finals) as "min_time" from times
    
where id_competitor=XXXX and finals>0
    
group by course, event_code

Or are you interested in returning other columns within the 'times' table?
if so, just use min(other columns) in the query, no need for a derived table.

So, in any case, web2py works a little differently. Taking the book example and modifying it, here's what I get...

for row in db(db.times.competitor==xxxx).select(
        db.times.course, db.times.event_code, min_time, groupby=(db.times.course,db.times.event_code)):
        print row.times.course, row[min_time]
What you need to do before that can work is to create the min_time operator...

min_time = db.times.finals.min()

oh, but you wanted to only select non negative numbers, and finals has some negatives in it...  (add a where clause to the select)
Good luck!

Joe Barnhart

unread,
Feb 19, 2014, 6:29:55 AM2/19/14
to web...@googlegroups.com
Hi Derek --

Thanks for your response.  I had considered your approach, but just declaring min() for every column does not return the values I need.  Take for example the min time and the minimum "meet name".  Yes, I could get the lexical minimum of a character string that represents the meet at which the best time was recorded -- only it wouldn't be correct.  The min(time) and min(meet_name) would have no relationship to one another -- they would be from completely different rows.

That's why I must find the minimum time in one query, then use it to probe the entire table to find the full row it came from and read the other values.  I've been pouring over the web looking for an alternative, but I just can't find one for my situation.

Fortunately, I've been able to create a string and use "executesql" to get the query I need.  It took some doing, but I managed to get the query into a single "inner join" and with proper indexing it is amazingly fast.  I clocked it at about 20ms for a typical query on my data set.  I was still able to make the result a proper Rows object and use SQLTABLE to style the output.

Warm regards,

Joe B. 

Joe Barnhart

unread,
Feb 19, 2014, 7:06:30 AM2/19/14
to

Hi Richard --

Well, it wasn't the solution I was originally thinking of, but you clued me into the solution that works.  I used "executesql" and a big 'ol triple-quoted SQL string to get the query working.  And the surprise (for me) was how easy it was to get the output into a Rows object where I could use the power of web2py to format the output for my view.

The DAL is amazing and very broad in what it covers, but there are just certain corner-cases of SQL which will always elude it.  I've decided that's a good thing.  Software systems that try to do *everything* in a new paradigm tend to become so bloated with corner-cases that they lose their way.  Our little web2py has such sparse beauty it would be a shame to see it try to turn into "SQL part deux".

Thanks for your help and for spending some of your thought-photons on my problem.

-- Joe B.

Richard Vézina

unread,
Feb 19, 2014, 9:26:57 AM2/19/14
to web2py-users
:)

Richard


On Wed, Feb 19, 2014 at 6:35 AM, Joe Barnhart <joe.ba...@gmail.com> wrote:

Hi Richard --

Well, it wasn't the solution I was originally thinking of, but you clued me into the solution that works.  I used "executesql" and a big 'ol triple-quoted SQL string to get the query working.  And the surprise (for me) was how easy it was to get the output into a Rows object where I could use the power of web2py to format the output for my view.

The DAL is amazing and very broad in what it covers, but there are just certain corner-cases of SQL which will always elude it.  I've decided that's a good thing.  Software systems that try to do *everything* in a new paradigm tend to become so bloated with corner-cases that they lose their way.  Our little web2py has such sparse beauty it would be a shame to see it try to turn into "SQL part deux".

Thanks for your help and for spending some of your thought-photons on my problem.

-- Joe B.
On Tuesday, February 18, 2014 6:59:45 AM UTC-8, Richard wrote:
So... You seem to be ready for a stored procedure... Or you need to thought to your problem into more then one step, maybe having a view for the sub query...

I was to porpose the multiple subquery, since your query is already slow, I would try this option, just to see if it couldn't be a bit faster...

For slow query that for which I don't want to invest more time to create a function or stored procedure, I generally create a view and cache it with web2py...

Richard



Derek

unread,
Feb 19, 2014, 12:53:28 PM2/19/14
to web...@googlegroups.com
I don't get what you are talking about. you are grouping by the event_code and the course, then the min should give you the minimum for that particular event code and course. You want more columns then group by them as well.

Joe Barnhart

unread,
Feb 19, 2014, 3:00:27 PM2/19/14
to web...@googlegroups.com

Hi Derek --

Yes, I need more columns.  I need the name of the meet where the best time was achieved.  And the club the athlete was a member of, and the date of the fastest time, etc.

I can't put all of those columns in the group and ask for "min" of each because the min of each column will be from different rows of the table.  I want the minimum time, AND the name of that exact meet -- not the minimum time and then the "min" across all meet names.  That's why I need to find the min time in one query and then use that to look up the row where it occurs, so I can get the rest of the data.

See my finished page in my response to Richard and maybe it will make more sense.  It's hard to put database stuff into words!


Warm regards,

Joe B.
Reply all
Reply to author
Forward
0 new messages