Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Question about nested JOINs+subqueries, which way is most efficient?

0 views

Skip to first unread message

Robert Maas, http://tinyurl.com/uh3t

unread,

Nov 22, 2009, 7:41:10 PM11/22/09

For an application where users can nominate answers to survey
questions, the master user can censor these to avoid abusive
language such as obscenity etc. and approve all the reasonable
answers, and then users can vote on their favorite already-approved
answers by investing labor credits, I have three tables of
relevance:
- surveys(id int(8) NOT NULL auto_increment, ...)
;Lists the survey questions themselves
- surans(id int(8) NOT NULL auto_increment,
idsur int(8) ...)
;Lists answers and associates them with questions
- surinvest(id int(8) NOT NULL auto_increment,
userid int(8) ...
idans int(8) ...
ms int(8) ...)
;Associates between users and answers, showing milliseconds of investment
(There's also a table of users, where surinvest.userid is linked,
but not directly relevant to the query I'm building here.)
When a user nominates a new answer, that's indicated by a surinvest
record showing 1 millisecond of investment by that user in that
answer.
When the master user (userid=1) approves a previously-nominated
answer, that's indicated by a surinvest record showing 1
millisecond of investment by the master user in that answer.
Then when the user who nominated an answer sees that it's been
approved, he/she can increase the surinvest record to show more
than 1 millisecond, bu an UPDATE to the original nomination
surinvest record..
When some *other* user sees a new answer, he/she can INSERT a new
surinvest record.

When a user nominates an answer to a question, it may be a
brand-new answer (each user may have at most one not-yet-approved
answer to *each* question at the same time), or an edit/amendment
to a previously-nominated answer. I want to do INSERT if it's new
but UPDATE if it's an amendment. Currently I have code to *always*
do INSERT, as a temporary measure to get the basic service
up&running, but now that it's running I want to do UPDATE when
appropriate, hence this new querey I'm asking about.

I have three primitive queries (PHP/MySQL):
$queansque = "SELECT id,idsur,shortitl FROM surans WHERE idsur = $surid";
/* i.e. all answers to this particular survey question, which is
expected to forever remain small, at most ten or maybe twenty
different answers for one survey question. */
$queinvusr = "SELECT id,idans FROM surinvest WHERE userid = $usernid";
/* i.e. all answers nominated and/or voted by this one users for
*all* questions, which might get semi-large, in the hundreds
or thousands. */
$queinvmas = "SELECT id,idans FROM surinvest WHERE userid = 1";
/* i.e. all approved answers to *all* questions, which expected
to get quite large, in the millions or billions eventually. */
which will all be included in the main query as nested queries
and/or direct JOINs, depending on how I decide to organize the
complicated query that relates all three tables. The query I need is:
all currently-pending nominations by this one user to this one question,
which is computed as
*all* answers to *this* one question which have an investment by
this one user, such that the master user has *not* invested in
this same answer
which in SQL is expressed as a JOIN between $queansque and
$queinvusr to get all answers to *this* question nominated/voted by
*this* user, and that LEFT JOINed to $queinvmas to see whether each
such has been approved or not, and WHERE queinvmas.id IS NULL to
eliminated all that are already approved, leaving only the pending
nominations. But I could also do the LEFT JOIN between $queansque
and $queinvmas first, to get all approvals of answers in this one
table, then then JOIN that with the $queinvusr, and fially WHERE
clause. Either way uses each sub-querey expression just once, hence
is sytactically efficient, but each involves a rather large
sub-query, which only *later* gets reduced by means of a JOIN,
which might be operationally expensive. A third way, duplicating
one of the sub-query expressions in order to *immediately* reduce
the size of sorta-temporary tables is to do both the JOIN and LEFT
JOIN WHERE at the start, i.e.:
$queuserqueans = ... $queansque JOIN $queinvusr ...
$quependqueans = ... $queansque LEFT JOIN $queinvmas ... WHERE ...
(see where the SQL string $queansque) appears twice)
(And of course I wouldn't write either of those explictly as JOIN
or LEFT JOIN with nested SELECT on each side, rather I'd hand-edit
each to be a single flat JOIN or LEFT JOIN query, something like:
$queuserqueans = "SELECT surans.id " .
"FROM surans JOIN surinvest " .
"ON surans.id = surinvest.idans " .
"WHERE surinvest.userid = $thisuser " .
"AND surans.idsur = $thisquestion";
$quependqueans = (too complicated for me to work out right now)
)
Then JOIN those together (by nesting the above on opposite sides of
the JOIN keyword).

In any case, the final result will have either one row or no rows
at all (two or more rows indicates a bug somewhere else), which I
can then use to decide whether to UPDATE that one row or INSERT a
new row.

So what's more important for efficiency, to avoid duplicating
sub-expressions, or to reduce the size of virtual tables created by
sub-queries as quicly as possible even if that means duplicated
sub-expressions? Which of those four methods scales best?
Any *other* even-better ways I didn't even think of?

Currently http://TinyURL.Com/Portl1 has only four users, one survey
question, and four or five answers, so efficiency isn't yet an
issue. I could write crappy code now and worry about efficiency
later when I have more than a thousand users and more than a
hundred survey questions, but I think I'm in the mood to do it
right the first time now so that I'll never have to re-write this
code for better efficiency.

No, I'd rather not have to do it all four ways on a test site,
artificially populate the tables with fifty million fake users
voting randomly for ten thousand survey questions and master user
randomly approving 99% of answers leaving 1% not yet approved, and
then measure the actual time each way takes. I'd rather an expert
tell me which factor is usually most crucial for efficiency, hence
which *one* way to write the code now, or even a totally different
way to get the same result that's more efficient than any of my
ideas.

0 new messages