Prior to reading the book, I've always had understood that anything we
did in realm of relational algebra (or calculus) would be set-oriented
and to lesser extent sames applies to the SQL implementation, at least
in the theory (as not all vendors necessarily are consistent in the
implementation). Given that any kind of update operation is logically
all at once, I was quite concerned with the manners of updating itself.
Given two relvars, r and s, containing some numbers of attributes,
sharing a common attribute (or in more general parlance, s relvar
contains a foreign key to relvar r's candidate key), and we wish to
evalulate an expression upon the relvar r and s. Assume the evaluate
will include restrict, project and join, though we need not restrict to
only those three operators.
Relational algebra, like ordinary algebra, can be employed in help us
re-formulate the expression into a even simpler expression, with the net
result that we can reduce the numbers of tuples to be evaluated as we
process the expression. This can be done by examining the operators'
properties such as joins having the commutative property and thus
choosing the smaller relvar and evaluate its tuples against the bigger
relvar's tuples. So far, so good. We can see how much relational
algebra/calculus can help us optimize any kind of queries by
transforming the expression.
But... I don't see any means within the relational algebra that provides
a way of evaluating multiple tuples in one go. Considering that a
relation is essentially a collection of propositions conforming to a
given predicate, it seems necessary to evaluate each tuples to determine
whether they should participate in the join or not, satisfy the restrict
condition and other things. For a lack of better terms, there is no
"sieve" we can employ to evaluate a set of tuples in one go.
Is the preceding paragraph accurate?
TIA.
This is not really the case. All these transformations are ad-hock. If
you try proving, say, "pushing selection through projection" you would
get bogged in set notation with awkward attribute indexing, and even
if succeed actually proving it, it would hardly result in any
insight.
> But... I don't see any means within the relational algebra that provides
> a way of evaluating multiple tuples in one go. Considering that a
> relation is essentially a collection of propositions conforming to a
> given predicate, it seems necessary to evaluate each tuples to determine
> whether they should participate in the join or not, satisfy the restrict
> condition and other things. For a lack of better terms, there is no
> "sieve" we can employ to evaluate a set of tuples in one go.
>
> Is the preceding paragraph accurate?
It is too vague. Algebras, in general, don't care about the structure
of their elements; relational algebra, in particular, doesn't refer to
tuples. (BTW the fact that it refers to attributes is a flaw -- c.d.t
regulars might have already guessed I'm about to sing the familiar
relational lattice tune:-)
What you appear to be looking for is a physical structure not a logical
one, which in SQL one would build with the "CREATE INDEX" statement.
I suppose it's true that we can transform an expression into anything
else and could conceivably add as many zeros for additive/subtractive
operations or ones for multiplication/division operations and still have
logically equivalent expressions, but such expansion would hardly be
interesting in absence of further transformation.
Even so, wouldn't it be fair to say that in order to optimize any given
query upon relation, we would want to find the most simple (or maybe
it's accurate to say 'concise') expression that is computationally
simple to execute compared to other but equivalent expressions?
> It is too vague. Algebras, in general, don't care about the structure
> of their elements; relational algebra, in particular, doesn't refer to
> tuples.
Then I obviously don't have a full understanding of the theory as I
would like to think.
However, could I ask you to explain the statement that tuples aren't
referred at all in relational algebra?
My objective is to understand whether operators exists that enables us
to perform evaluations on a set of propositions as whole rather than
having to verify each proposition.
Back to the r and s example, if I were to join them based on an
attribute x, I would have to look at all possible values in attribute x
of the relvar r then go to s and look up each tuples in relvar s and
verify whether its attribute x has the same value and thus 'select' the
tuple into the resultant output. I've just described an iterative manner
of evaluating what tuples may participate in a join and that's precisely
concerns me; was that actually necessary or does there exist an operator
to process them in bulk?
Or maybe to attempt an analogy, when we multiply two numbers, we need
not add each number to itself as many times as specified by other
operand. It's valid but not the only way to determine the product; we
certainly can determine product by recalling the product from
multiplication table and even for an arbitrarily large operands, use the
same algorithm to quickly arrive at a product.. Does such concept exist
for operating upon relations in like manner?
I do apologize in advance if I'm being vague or imprecise; as I said I'm
quite new and trying to grasp the pictures, though I must admit that I'm
also looking at the whole thing through the lens of practicality; how
does it help us and how much can it help us? It obviously helps on
providing logical design, but I'm looking at the optimizations that it
may have to offer to whatever implementation.
Again, thank you.
Well, I do apologize if I provided the impression that I was looking for
the physical structure or implementation. In fact, what I'm trying to
visualize the practicality relational theory has for any given
implementation, and I can obviously see how it applies in logical design
(or at least I hope I did understand it correctly), however, the same
book I mentioned in my OP did mention (too briefly, alas) how relational
theory can help optimize queries, which I tried to illustrate in my OP
citing the joins' commutative property to transform an expression into a
equivalent expression that may be computationally simpler to resolve.
With that said, I am trying to understand whether it provides us with a
mechanism of evaluating relations as a set without requiring iterations
as I attempted to explain in my other post to Tegiri.
I hope this helps clarifies my quest toward a more solid understanding
of the theory.
It might be clearer to say that while the algebra's ops don't refer to
tuples, the sets involved certainly do contain tuples and as far as I
know, there are no relational ops that can be defined without reference
to tuples.
The theory allows the query to be reformulated without changing its meaning,
but it cannot help estimate the cost of executing one or the other. A simpler
expression may in fact cost more to evaluate.
Primitive RDBMS used heuristics to apply transformations to expressions in
the hope of reducing execution cost. In the mid 80's, most RDBMS ditched
these heuristic optimisers and built cost-based ones. These estimate the
execution cost by using statistics (row counts, index selectivity, ordering
and clustering of indices, etc), though some, like MySQL, still use
heuristics (these are still classed as primitive).
The change was triggered in part at least by the work of Ken J McDonell
on the MUSBUS benchmarking tools, who was able to point out the glaring
differences in actual behaviour of the RDBMS he benchmarked.
--
Clifford Heath, Data Constellation, http://dataconstellation.com
Agile Information Management and Design
Fascinating. It sounds like C.J. Date may have been too optimistic in
asserting that relational theory is sufficient in optimizing the
evaluation of a given expression. (well, he didn't say it can do so
exclusively, but my puzzlement definitely dealt with the 'if relational
theory is supposed to help with this, then why has SQL strayed so far
from the relational theory as understood in mathematics.')
On an aside, though, that didn't quite answer my intended questions of
whether theory provides a means of evaluating in bulk without requiring
one to evaluate tuples individually, as I attempted to explain/ask in my
other posts. But I suspect even if the answer was a yes, the theory has
little to say about how one can optimize such queries.
> The change was triggered in part at least by the work of Ken J McDonell
> on the MUSBUS benchmarking tools, who was able to point out the glaring
> differences in actual behaviour of the RDBMS he benchmarked.
Thank you. I may look his works up and learn a bit about that. It does
provide a hint (?) toward why the actual implementations is quite
divorced from the relational theory.
I'm not following your "add zeroes" idea, but if you are saying that
only those transformations that lead to "shorter" relational
expressions make sense from optimization perspective, than this is not
true. E.g. "magic" transformation.
> Even so, wouldn't it be fair to say that in order to optimize any given
> query upon relation, we would want to find the most simple (or maybe
> it's accurate to say 'concise') expression that is computationally
> simple to execute compared to other but equivalent expressions?
Are you familiar with cost based optimization? They basically throw in
multiple equivalent expression and estimate cost of running each
apriori.
> However, could I ask you to explain the statement that tuples aren't
> referred at all in relational algebra?
Simple. Take projection for example. It contains the pi symbol
together with attributes names. Ditto selection and the others.
> Back to the r and s example, if I were to join them based on an
> attribute x, I would have to look at all possible values in attribute x
> of the relvar r then go to s and look up each tuples in relvar s and
> verify whether its attribute x has the same value and thus 'select' the
> tuple into the resultant output. I've just described an iterative manner
> of evaluating what tuples may participate in a join and that's precisely
> concerns me; was that actually necessary or does there exist an operator
> to process them in bulk?
That is called nested loops join. It is implementation; algebraically
(natural) join is idempotent, symmetric, and associative operation.
> Or maybe to attempt an analogy, when we multiply two numbers, we need
> not add each number to itself as many times as specified by other
> operand. It's valid but not the only way to determine the product; we
> certainly can determine product by recalling the product from
> multiplication table and even for an arbitrarily large operands, use the
> same algorithm to quickly arrive at a product.. Does such concept exist
> for operating upon relations in like manner?
In this respect Bob's answer was spot on: while nested loops is quite
inefficient operator, the indexed nested loops isn't.
But this is like saying that one can't define summation and
multiplication in a ring with abstract elements. You wouldn't insist
that any advanced algebra book had a long arithmetic introduction with
tedious instructions how one adds and multiplies numbers (in their
decimal representation)?
When it comes to relational algebra, I doubt if there are any advanced
books!
Yes, that's what I was trying to understand and your statement
correlates with Clifford's statement that no 'magic' transformation
actually exists.
> Are you familiar with cost based optimization? They basically throw in
> multiple equivalent expression and estimate cost of running each
> apriori.
Yes, but I suspect I had a mistaken notion of what it actually did; I
thought they basically selected an expression that was known to be
cheaper than other equivalent expression but as there is no magic
transformation, it's probably basically a guessimate (though I'm quite
sure it's more complicated/sophisticated than that).
>> However, could I ask you to explain the statement that tuples aren't
>> referred at all in relational algebra?
>
> Simple. Take projection for example. It contains the pi symbol
> together with attributes names. Ditto selection and the others.
>
Just to validate that I've understood your point which I'm not sure I
did. Projection need not refer to the tuples because we can just present
an expression of attribute and the pi symbol. (But wouldn't one still
have to look at the original value to arrive at the resulting value + pi?)
> That is called nested loops join. It is implementation; algebraically
> (natural) join is idempotent, symmetric, and associative operation.
Thank you. I've actually heard of the term (both without an index and
with an index), but mistakenly assumed that it was a DBMS-specific thing
as I don't recall hearing it outside that context.
> In this respect Bob's answer was spot on: while nested loops is quite
> inefficient operator, the indexed nested loops isn't.
Agreed, that's basically what I've had understood anyway. My apologies
for not having made it clear in my earlier description of the relvars.
Even so with benefit of an index, one would still have to iterate over
each tuple to evaluate the expression. In other words, the theory does
not provide a mean of resolving a set of tuples in a single operation
without resorting to the nested loops, and thus various vendors has had
to invent different ways not directly specified or implied by the theory
to help optimize on the process (e.g. clustering to enable a range seek
upon the index for instance). Is that accurate?
And so you know, I do appreciate you taking your time to explain those
things to the uninitiated.
> Tegiri Nenashi wrote:
>
>> if you are saying that only those transformations that lead to
>> "shorter" relational
>> expressions make sense from optimization perspective, than this is not
>> true. E.g. "magic" transformation.
>
> Yes, that's what I was trying to understand and your statement
> correlates with Clifford's statement that no 'magic' transformation
> actually exists.
>
>> Are you familiar with cost based optimization? They basically throw in
>> multiple equivalent expression and estimate cost of running each
>> apriori.
>
> Yes, but I suspect I had a mistaken notion of what it actually did; I
> thought they basically selected an expression that was known to be
> cheaper than other equivalent expression but as there is no magic
> transformation, it's probably basically a guessimate (though I'm quite
> sure it's more complicated/sophisticated than that).
I think you are missing the forest for the trees. It's really quite
simple: The first stage of the optimizer spits out a series of
equivalent relational expressions by rearranging the original expression
according to identities in the algebra. e.g. properties like
transitivity, commutativity etc. The second stage estimates a cost for
each of the expressions until it thinks it has one that is "cheap enough".
Having a simple algebra with plenty of useful identities makes creating
an optimizer much easier. SQL dbmses do something similar, but things
like duplicates and null complicate the algebra and reduce the number of
useful identities. As a result, SQL optimizers are more complicated and
error-prone than they might otherwise be.
One example where the expression is longer but the execution is usually
faster happens when pushing restrict through join. e.g. suppose one
restricts a result to customers in a particular state 'AK' and one joins
the customer relation to a state relation. Expanding the expression to
evaluate state='AK' for both customers and states will reduce the cost
of the join by eliminating all the other states from consideration.
Assuming one has an index on the state relation the join will be very fast.
This is no different from saying that in order to add numbers the
addition operation has to look inside the number structure (that is
decimal representation).
Perhaps you want the EXTEND operator:
EXTEND R ADD A+B AS C
This results in a relation like R but with a new attribute C;
each result tuple's C value is its A value plus its B value.
More generally, if you think of an operator like + as corresponding
to a relation PLUS with numeric attributes LEFT, RIGHT and RESULT
(consisting of all possible tuples where LEFT + RIGHT = RESULT)
then the above is equivalent to
R JOIN (PLUS RENAME LEFT AS A, RIGHT AS B, RESULT AS C)
Of course optimizers use algebraic identities to rewrite.
They just also use other information like size and indices
to create a tree with some optimized property.
The biggest time expense tends to be
secondary storage page access.
It might seem such things have to be implemented as loops,
but it all depends on your implementing machine.
Maybe it is a parallel processor.
If you want the longest stick of spaghetti you just hold
them all upright on the counter.
If relational algebra were presented properly you would find that
first of all it is an algebra schema,
since every set of types produces a different algebra;
that those algebras are "many-sorted", the sorts being relations,
attributes, sometimes attribute lists, sometimes quoted-expressions,
and the types;
and that you would have additional operators like
ONE_BY_ONE_RELATION_type(ATTRIBUTE, value) returning a relation
and VALUE_type(relation) returning a value of the type.
(You don't need to have explicit tuples.)
philip
Indeed. I would challenge the term "relational algebra". Consider how
relational division is introduced; a typical textbook goes like this:
"Consider the two relations with disjoint headers ... Their cartesian
product is ... Now imagine we want an operation that is inverse to
cartesian product..."
The discussion often revolves around concrete examples (PilotsSkill/
Hangars, Certifications/Requirements, etc). This is essentially the
level of elementary second grade math textbook depicting how to
arrange items into rows and columns, and then how to calculate the
number of rows when one knows the total number of things. It is
arithmentics, not algebra.
An algebraic question maybe if division is residuated operation (it is
not); then it is mysterious why relational division fits the role of
inverse to join. Perhaps, this fact hints the existence of many
division-like operations?
http://vadimtropashko.wordpress.com/relational-programming-with-qbql/aggregation-and-set-joins/
I believe you've had a good answer in the existence of NULLs and duplicates,
but I don't know that SQL is *that* far. The relational algebra doesn't
cover the most common four ways to implement an inner join; implementation
isn't its concern. But choice of looping, merge joins, hash joins, etc,
definitely affects query performance.
Also you have to appreciate that SQL is essentially a verbal form of
relational calculus. That gets translated to algebra, and progressively
refined towards physical operations in the formulation of a query plan.
As a calculus, it isn't going to look algebraic, as, for example, QUEL did.
>> The change was triggered in part at least by the work of Ken J McDonell
>> on the MUSBUS benchmarking tools, who was able to point out the glaring
>> differences in actual behaviour of the RDBMS he benchmarked.
> Thank you. I may look his works up and learn a bit about that. It does
> provide a hint (?) toward why the actual implementations is quite
> divorced from the relational theory.
I wish more of his work was still available - I heard him speak about
it - but he's still around, see <http://www.kenj.com.au/>. He doesn't
live that far from me, and I've thought about looking him up. Perhaps
he has some archived material he'd email you?
After MUSBUS, he continued the same kind of work at Pyramid Computers,
with their fantastic multi-processor MIPS machines. They had a policy
of being RDBMS agnostic, but liaising with the vendors closely on
performance analysis so their hardware ran every RDBMS better than
competitive h/w. I wish they were still around.
Thought about that and realized that is basically true of any system;
the act of evaluation itself requires that we, well, actually discover
the independent values to determine the dependent values (apologies if
this is not correct terms to use here).
Going back to the join problem, I think my confusion may have arose in
conflating the evaluation and the expression. We can write an expression
that proposes to join an arbitrary number of relvars but we still must
evaluate the final relvar baed on what tuples satisfies the given
conditions. Thus there is really no way around doing joins without some
kind of evaluation and evaluation is necessarily iterative. Not to say
that we can't make use of index and other techniques to save ourselves
the trouble of needlessly evaluating tuples that wouldn't have made into
the join anyway, but when left with only tuples that may or may not be
an eventual participator in the join each must be evaluated.
In former sense, we do have a 'sieve' in forms of indices, cost-based
optimization/transform, restricting the evaluation to only possibly
relevant tuples but the 'sieve' wasn't provided by the relational model,
and after filtering through the 'sieve', one still must evaluate the
individual tuples to generate the resulting output.
Is my description accurate now?
Bob, thank you very much for providing a practical example of why no
magic transformation exists.
Yes, I can easily see how one can select the longest stick of spaghetti
and to contrive the analogy a bit, we can slide it through a plate set
at a certain height and thus filter out the taller spaghetti, then one
more plate at a lower height and discard those that passed the 2nd
filter leaving us with spaghetti between x and y. In other words, range
seek. Going back to the relvar, it only applies to one relvar (not to
say one can't use the same filters for two different relvar under
certain condition as Bob provided in his example of customer and state
relations being filtered by 'AK' first prior to being joined together.
What I was really interested in learning whether join (or any operations
involving more than one relation) could be processed in similar manner
without necessitating loops across a set of tuples within a relation to
determine what tuples from other relation satisfy the condition. Based
on what I'm picking up (and I hope I'm following along) such thing is
unavoidable.
> If relational algebra were presented properly you would find that
> first of all it is an algebra schema,
> since every set of types produces a different algebra;
> that those algebras are "many-sorted", the sorts being relations,
> attributes, sometimes attribute lists, sometimes quoted-expressions,
> and the types;
> and that you would have additional operators like
> ONE_BY_ONE_RELATION_type(ATTRIBUTE, value) returning a relation
> and VALUE_type(relation) returning a value of the type.
> (You don't need to have explicit tuples.)
Indeed. Date basically argued that SQL DBMS was lacking in proper type &
operator support. A part of reason of why I'm asking those questions is
to help me understand how did SQL end up the way it is right now and why
relational model wasn't more closely adhered to. I had latched on the
notion that it had to do with that relation model (as far as I know
about it via Date's book) didn't say much about how to optimize the
evaluating of given expression so it's computationally cheaper to
process the given expression.
Indeed, Date gave a good discussion about why neither made sense in
relations.
> but I don't know that SQL is *that* far. The relational algebra doesn't
> cover the most common four ways to implement an inner join; implementation
> isn't its concern. But choice of looping, merge joins, hash joins, etc,
> definitely affects query performance.
Yes, I may have had picked up mistaken impression from the same book
thinking that relational model had something to say about optimizing the
queries. It doesn't look to be the case here, and would definitely
explain why SQL is fairly divorced from relational theory.
But it's not just the relational theory. Date provided an argument for a
need of proper type & operator support as I told other poster in
previous reply. Off the top of my head, not all vendors supports
user-defined types (probably only handful) and of those that do, they
don't provide a means of defining an operator for those types. Granted,
they could be replicated via use of stored procedures/functions,
middleware, or whatever. But what I think I had understood was that the
"relational-object impedance" is fictional; impedance only exists
because vendors didn't properly implement such things which would have
had disposed of the problems of interacting with relations via OOP.
I think Date also argued that SQL's structure of SELECT - FROM - WHERE -
... is fairly rigid and may further complicate the expressiveness of
queries and as a consequence, it's possible to write several different
logically equivalent queries and have difference performance
ramifications; IOW, SQL doesn't provide enough of independence from
implementation. Whenever anyone has to port an application using say,
SQL Server to Oracle, I would definitely expect some rewrites will be in
order even if most of code was written in standard SQL and didn't rely
on vendor-specific features/implementations.
One more reason also has to do with SQL's seemingly inconsistencies (to
be specific here, SQL as implemented by different vendors, rather than
the SQL standard). A prime example is the case of an SQL table where we
have single column with UNIQUE constraint. Let's say it contains five
rows consisting of { 5, 3, 4, 2, 1 } and we want to run this query:
UPDATE t SET col = col + 1;
Depending on which vendors' DBMS you use to run the query, it may suceed
or fail. In Relational theory, it should succeed because the constraints
should be evaluated on a per-statement basis, not per-row statement.
This is a trivial example, but several more inconsistency does exists
and I found that quite bizarre. I'm not a mathematician by trade but I
do believe mathematics has rigid definitions so there's no "alternate
version" of math, yet nobody seems to be adhering to the relational
theory closely enough, as Date tries to argue in his book. Thus, my
questions on this newsgroups.
> Also you have to appreciate that SQL is essentially a verbal form of
> relational calculus. That gets translated to algebra, and progressively
> refined towards physical operations in the formulation of a query plan.
> As a calculus, it isn't going to look algebraic, as, for example, QUEL did.
Interesting. IIRC, Date basically claims that SQL is a bit of relational
algebra, a bit of relational calculus and a bit of neither, though I
also recall that one objective of SQL was to make it easy to express
without resorting to mathematics as its preprocessors required.
> I wish more of his work was still available - I heard him speak about
> it - but he's still around, see <http://www.kenj.com.au/>. He doesn't
> live that far from me, and I've thought about looking him up. Perhaps
> he has some archived material he'd email you?
Thanks. I will look at the site.
> After MUSBUS, he continued the same kind of work at Pyramid Computers,
> with their fantastic multi-processor MIPS machines. They had a policy
> of being RDBMS agnostic, but liaising with the vendors closely on
> performance analysis so their hardware ran every RDBMS better than
> competitive h/w. I wish they were still around.
Fascinating.
The thing to keep in mind is, in the end, one has to implement the dbms
on physical hardware. With current computational models and storage
devices, iteration is unavoidable because everything is based on some
linear address space or another. That doesn't mean we won't ever have
devices capable of set level operation. It just means we don't today.
I am reminded of a remark in the Dragon Book (Aho et al.) about
optimization. After giving methods for a variety of different compiler
optimizations, the authors remarked that often algorithmic replacement
achieves far more improvement than all of the available optimizations
combined. e.g. replacing a bubble sort with a quicksort. They go on to
remark that algorithmic replacement remains an unachievable goal.
Using a high-level, declarative, algebraic specification like the RM
changes the task from algorithmic replacement to algorithm discovery or
generation, which is a much easier task.
Yep. Minor correction: the 'sieve' wasn't provided by the relational
algebra. Relational model, however, does refer to tuples, so, from
definition of join in set notation, for example, one may infer nested
loops algorithm.
At the risk of confusing you more: "magic optimization" is actually a
legitimate technical term.
I could be wrong, but I suspect looking for "magic sets rewriting" will
bring up more relevant results than looking for "magic optimization".
Actually, SQL itself has relatively little to say about query optimization.
When code jockeys set out to learn SQL, the hardest thing to get them to
learn is to refrain from coding in such a way as to gove implicit "hints" to
the optimizer in the way they express their queries.
Getting programmers to think in terms of "what" rather than "how" has been a
continuing struggle since the 1950s. When FORTRAN was first launched on the
scene, a whole generation of 2GL programmers made it their goal to write
their FORTRAN in such a way that the compiler would discover an efficient
machine language equivalent. That might be a noble goal in and of itself,
but when it interferes with writing coherent and flexible code, it's very
counterprodictive. When virtual memory came on the scene, another whole
generation spent more time than they should have anticipating program memory
references so as to minimize paging. There is a small class of programming
problems where this kind of thing makes sense. There is a much larger class
where it does not.
Matters were made even worse when programmers learned how to manipulate the
"rules based optimizer" in Oracle, before the days when it had a cost based
optimizer. The "cool" programmers learned how to tell Oracle how to do its
job by doing such things as putting a certain table last in the FROM clause.
All of this distracted from coming up with a good logical design and clear
coherent code. And "hints" were an even deeper descent into "how" rather
than "what".
Not only that, but there are a lot of junior SQL jockeys who have a
completely clouded idea of how a cost based optimizer works, and try to
"help" the optimizer in ways that are useless or even harmful. I can't tell
you the number of times I've been approached by an SQL neophyte whose query
gives incoreect results, and all you have to do is change "SELECT" to
"SELECT DISTINCT". They'll give a negative response, because they've been
told that SELECT DISTINCT runs real slow, so they avoid it. Their method is
to first come up with a solution that runs fast, even if it gives wrong
answers, and then fix it so that it gives right anwers. It's usually a
better plan to come up with a solution that gives right answers, and then
fix it so that it runs fast as well. And in some cases, SELECT DISTINCT
actaully runs faster than SELECT.
In another case, I saw where a program specified in a WHERE clause, the
following: "STATE = progamvariablestate AND CITY = programvariablecity".
When I asked the programmer why he didn't specify the country, he told me
that the country was always USA and he wanted to keep things simple for the
DBMS. After looking at the indexes, I was able to speed up the query by
changing the criterion to "COUNTRY = 'USA' AND STATE = progamvariablestate
AND CITY = programvariablecity". The delay time dropped from about 6
minutes to less than 2 seconds. (for extra credit: can you guess what I saw
in the indexes?)
My point in all of this ramble is that telling the DBMS what to do and
telling it how to do it are two separate ways of coding.
Tegiri, thanks. I presume this is only because transformation isn't the
only process that can be employed in optimizing the given
query/expression, yes?
Bob Badour wrote:
> I could be wrong, but I suspect looking for "magic sets rewriting" will
> bring up more relevant results than looking for "magic optimization".
Thank you, again, Bob.
Thank you for sharing, that was an enlightening read. Interestingly
enough, it happened that someone asked for an 'exclusionary join', and
the answer we could only offer was an awkward union & frustrated joins.
I think there's something to be said about sticking to expressing query
as an expression rather than as statement. That would be far more easier
to parse & optimize, I'd think.
You're preaching to the choir here. I do cringe a little when I see all
those fancy "hints" which I've never really seen the occasion to use.
I'm sure it may make sense in a exceptional or specialized case but
that's all it is. Most queries are best well left enough to the engine.
> Getting programmers to think in terms of "what" rather than "how" has been a
> continuing struggle since the 1950s.
Fascinating; it shows that more things change, the more they stay same.
> Matters were made even worse when programmers learned how to manipulate the
> "rules based optimizer" in Oracle, before the days when it had a cost based
> optimizer. The "cool" programmers learned how to tell Oracle how to do its
> job by doing such things as putting a certain table last in the FROM clause.
> All of this distracted from coming up with a good logical design and clear
> coherent code. And "hints" were an even deeper descent into "how" rather
> than "what".
Well, as I agreed with you, one shouldn't really be fiddling with the
engine, but as I remarked in an earlier post if we were to port a
complex database project from say, SQL Server to Oracle, and even if we
didn't use any vendor-specific features, several SQL rewriting may be
warranted simply because of performance differences due to the
differences of how a SQL statement is parsed & optimized between two
engines.
> I can't tell you the number of times I've been approached by an SQL neophyte whose query
> gives incoreect results, and all you have to do is change "SELECT" to
> "SELECT DISTINCT". They'll give a negative response, because they've been
> told that SELECT DISTINCT runs real slow, so they avoid it. Their method is
> to first come up with a solution that runs fast, even if it gives wrong
> answers, and then fix it so that it gives right anwers. It's usually a
> better plan to come up with a solution that gives right answers, and then
> fix it so that it runs fast as well. And in some cases, SELECT DISTINCT
> actaully runs faster than SELECT.
A while back, we discussed the problem of non-matches, of which the
traditional answer in SQL is to use a frustrated join to get the
nonmatches. However, I found a case where it was actually faster to
execute the query using IN() & correlated subquery (which had a
notorious reputation for being a drag) than the frustrated join when
there were very small or no chance of actual match being found and
excluded. Note, though, that this is still implementation so this may
not hold for all other DBMS's implementation.
Thus, this kind of situation is quite common, and I do think the SQL
literature is filled with many of "myths" which are in fact based on a
grain of truth, but there's so many exceptions and "but..." attached to
make it hard to apply more or less unconditionally. This may be probably
why I went and picked up a book by Date; to learn a bit about relation
theory and see why SQL basically ended up the way it is nowadays.
> My point in all of this ramble is that telling the DBMS what to do and
> telling it how to do it are two separate ways of coding.
It shouldn't be forgotten that all abstractions are leaky and SQL is no
exception here. It's unfortunate, yes, as the objective was to provide
independence from the specifics of performance in retriving a desired
set regardless of how it was expressed but the fact is that sometime
optimizer fails to recognize a better expression. Those leaks, I
suspect, are the motivation for providing hints, even though for most
cases people would do well to shun them until it was absolutely necessary.
Sometime I think that a bit more tighter adherence to the relation
theory may help to stem in leaks but even so, it won't plug all leaks.
No abstraction ever devised has been not leaky but we certainly can make
it less leaky than others.
I'm not going to take exception to all attempts to optimize. I do take
exception to a premature preoccupation with performance optimization, to the
detriment of all other measures of "goodness". I didn't mean to preach to
the chior, here. I just wanted to point out what this issue looks like from
the trenches.
Well, sometime we have to reinforce this. I've been guilty of being
wow-ed by the latest and newest whizzbang which after a closer
examination turned out to leave us even worse off. I do think vendors
are doing us (the programmers, developers & whoever works with
databases) a disservice when they enable the bad behaviors by providing
hints (as well as other "features") as opposed to demanding more
rigorous discipline over the design and writing expressions. But I don't
think demanding discipline sell products, unfortunately, at least not
without redesigning in such way that it's more intuitive to exercise
discipline than going at it in an ad hoc fashion.
Unfortunately MySQL is still teaching the new generation that the order of
joins actually matters.
> Getting programmers to think in terms of "what" rather than "how" has been a
> continuing struggle since the 1950s.
Now you're singing my song. This is one of the most fundamental problems
of the industry, going far beyond just the database world.
I've had to use hints in MSSQL (2000) in order to avoid a nasty optimiser
bug where it ignores the obvious index that can make a join work. This was
with a DELETE FROM a WHERE NOT EXISTS(...) (an anti-semi-join).
The sellers know the buyers are ignorant, not unlike many other fields.
Unlike in most other fields that are so expensive to participate in,
the ironic aspect in IT is that the "free software" developers usually
mimic what the commercial developers do (and the bulk of them aspire to
be commericlal). For the remaining few virtue is the only remaining
reward and it's a waste of time to complain about what the rest do.
I haven't studied any of the work on formalising the RM as an
algebraic structure. My understanding is that an algebraic structure
typically consists of a set of abstract elements closed under one or
more operators and satisfying some axioms.
I would have thought that set theory itself cannot be regarded as an
algebraic structure - because it is not possible to form the set of
all sets (by Russell). Operators like 'union', 'intersection' and
'element-of' don't have a domain and therefore are not functions.
Wouldn't the RM suffer from the same limitation (well at least an
untyped version of the RM)?
Sorry if I'm talking about things outside my depth.
No, that's not why. I don't actually see any reason to exclude
sets from the study of algebraic structures. Perhaps the
one reason would be that sets are generally at or near the
foundational bottom of the mathematical hierarchy, and
we risk circularity if we treat them as algebraic structures.
Normally the "is-an-element-of" predicate is the only operator
and it is left undefined.
Russel's paradox is actually not difficult to deal with and only
comes in to play if the system admits unrestricted set
comprehension. That is, if you allow the construction of
sets via a description of their members. A low-cost alternative
to unrestricted comprehension is separation: start with a set,
and construct a new set that contains those members of the
original set that satisfy some predicate. (Instead of constructing
a set that contains all sets that satisfy some criteria.)
> Operators like 'union', 'intersection' and
> 'element-of' don't have a domain and therefore are not functions.
But they do have a domain: sets. In set theories such as
ZFC, there aren't any members of the domain of discourse
that aren't sets.
> Wouldn't the RM suffer from the same limitation (well at least an
> untyped version of the RM)?
I don't think so. It is entirely possible to have a relational theory
that only admits the existence of relations.
I don't generally like the approach of having a set theory that
only admits sets, but it's actually the more popular approach.
Marshall
Yes, an axiomatic system like ZFC is believed to be free from
contradiction because it doesn't allow unrestricted comprehension.
However assuming an algebraic structure by definition involves a *set*
of abstract elements, then I can't see how ZFC can itself be regarded
as an algebraic structure.
> > Operators like 'union', 'intersection' and
> > 'element-of' don't have a domain and therefore are not functions.
>
> But they do have a domain: sets. In set theories such as
> ZFC, there aren't any members of the domain of discourse
> that aren't sets.
I'll try and show my reasoning as explicitly as possible...
Claim: Under ZFC there is no set which is defined as the
set of all sets.
Proof: Suppose there exists U such that for all x, x is an
element of U. By the axiom of separation one can take
the subset R of U
R = { x in U : not (x in x) }
It then follows that if R is an element of R then it
can't be an element of R, and if R is not an element of
R then it must be an element of R. Either way this is
a contradiction which proves there is no such U.
Claim: The intersection operator is not a binary function
Proof: Every function has a domain which is by definition a
set. If the intersection operator is a binary function
then its domain is UxU where U is the set of all sets,
but U is not a set under ZFC. It follows that the
intersection operator is not a binary function.
> > Wouldn't the RM suffer from the same limitation (well at least an
> > untyped version of the RM)?
>
> I don't think so. It is entirely possible to have a relational theory
> that only admits the existence of relations.
>
> I don't generally like the approach of having a set theory that
> only admits sets, but it's actually the more popular approach.
I suspect there is no such thing as the set of all relations.
paul c wrote:
> The sellers know the buyers are ignorant, not unlike many other fields.
> Unlike in most other fields that are so expensive to participate in, the ironic
> aspect in IT is that the "free software" developers usually mimic what the commercial
> developers do (and the bulk of them aspire to be commericlal). For the
> remaining few virtue is the only remaining reward and it's a waste of time to
> complain about what the rest do.
I don't think it's that ironic, paul. The trouble with crowdsourcing is
that it's great at showing what *is* the common practice but total crap
at showing what *ought* to be the coding practice. Look at OpenOffice
and Ubuntu. They amounts to something like Microsoft Me-too! Then
contrast with Mac OS X and iWork, which both are closed source though
based on open source components, yet they work very well because
developers didn't get buried into decision by committee that plauges
FOSS at large.
My long-standing song is that it's easy to write a program, but entirely
something else to write a program that promote good behaviors on the
part of users. Failing to do that just encourages the users to break the
rules so they can do it their way, and I think that is somewhat true of
any SQL products, commercial or not. Users have little incentive to
adopt "good" behavior by letting optimizer doing their job. The blame
lies squarely upon the feet of vendors (and to extent the standard
committee as well)
On an unrelated matter, I think there is something to be said about
MySQL's interesting implementation. Many people recoil in horrors
regarding its apparent disregard for transactions and foreign key
support (at least for tables not using InnoDB engine), but to me this
makes perfect sense: If it's not needed, then the developer should be
able to choose to not need it. I'm not going to say that a financial
application should choose MyISAM tables to store its essential
can't-afford-to-lose-it data, but for applications where we just don't
give a hoot if we lose some data or simply don't need transactions, then
it makes sense to gain performance in that area.
As a practical example, I worked on a project for a client, having had
joined in midway so they already had decided on the specifications.
Basically, they used SQL Server as a reporting server, loading data from
Oracle DSS. The users just needs to aggregate data upon filters of their
choosing and generate reports; no updates except to dump the data and
load new dataset from Oracle on a regular interval. Did they really need
the transactional feature? Foreign key constraint? No, not really. Did
they want their reports fast, fast? Yes. So, They basically paid more
money for less performance when they could had use MySQL. (Granted, one
could tweak MS SQL server to ease up on various overhead but it can't be
turned off entirely and that's more work anyhow whereas MySQL is just
that out of the box)
But that's nothing to do with RM itself; it's just one more
implementation. From a pragmatic viewpoint, one can see why it may
*sometimes* make sense to leave RM/SQL while keeping the interface (e.g.
SQL language).
>> Getting programmers to think in terms of "what" rather than "how" has
>> been a continuing struggle since the 1950s.
>
> Now you're singing my song. This is one of the most fundamental problems
> of the industry, going far beyond just the database world.
>
> I've had to use hints in MSSQL (2000) in order to avoid a nasty optimiser
> bug where it ignores the obvious index that can make a join work. This was
> with a DELETE FROM a WHERE NOT EXISTS(...) (an anti-semi-join).
>
> --
I once ran across an optimizer that didn't realize that "group by A, B" and
"group by B, A" are equivalent, and thereby failed to use a composite index.
That optimizer was otherwise very, very good. I seldom worried about speed.
I'm not following this. A good optimizer does one job, and lets the
database designer/developer focus on a different job.
If "good behavior" is simple yet sound design in both database definitions
and database transactions, then I would claim that a good optimizer does not
discourage good behavior as much as a bad optimizer does.
Well, think about it. If the optimizer was flawed in such way that one
had to put tables in a certain order in the FROM clause or use hints,
then this encourages bad behavior because now the developers have to
fiddle with the optimizer, something they shouldn't be doing all the
along. A optimizer that was good would have made those behavior irrelevant.
Of course, the real culprit would be the vendors because they deigned to
implement in more or less ad hoc manners; no vendors are 100% SQL
standard compliant and SQL standard itself was suspect to start with.
Instead of addressing those more major flaws they chose to implicitly
encourage more bad behavior by providing hints or attempt to expand the
optimizer's scope, a job made much more harder due to nulls, duplicates
among other things.
To provide an simple illustration of how good implementation can promote
good behavior, consider whitespace in programming languages. It's a
common convention that one indents code blocks to make it readable. In
most languages, whitespace have no semantic significance so they were
free to indent or not indent. There are indeed conventions over how one
should indent the code but there will be codes here and there that may
not indent consistently. Python changed this by making the whitespace
semantically significant and thus requires indenting for a given block.
Thus it made the indent and consequently the code layout far more
consistent without requiring any thought from those who write code. This
works far much better than developing an IDE that auto-indent code for
people. Providing hints is just like providing auto-indents; it may help
fix the symptoms but not the underlying problem.
May I suggest that the problem is the set theory foundation -- the set
membership operation? After all, one can't possibly think of the least
pretty operation from algebraic perspective. Contrast it to
intersection, union, and (relative) complement -- which form Boolean
algebra on a powerset.
In RM the level of curly brakets nesting never goes higher than two,
so set theory paradoxes are irrelevant.
> Claim: The intersection operator is not a binary function
Isn't a binary operation on powerset boolean algebra?
Let R(T) be the set of all relations where every attribute has domain
T. For any set T, R(T) exists in ZFC (largely by virtue of the axiom
of power set).
Let X be the union of R(T) over all possible T. The definition of X
involves unrestricted comprehension and I don't think it exists under
ZFC.
It seems to me one can study R(T) as an algebraic structure - say with
the relational lattice operators, and to the extent that T is not
specified (but assumed to be a union type big enough to hold all
values of interest) it actually can be regarded as an untyped
treatment of the RM.
Does this make sense?
I'm not sure how RVAs fit into that picture. I think recursive types
should be allowed, but any given relation (value) only involves finite
nesting.
> > Claim: The intersection operator is not a binary function
>
> Isn't a binary operation on powerset boolean algebra?
Yes.
As I see it, if one has an operator defined on a proper class, one can
take a restriction to a domain which is a set (i.e. a subset of the
proper class) to make the operator into a function.
Here is a discussion of "proper class":