My question is rooted in T-SQL, SQL Server environment, but its scope is not confined to this technology. I am working on a database with a quite complex business logic, with existing views, stored procedures and new ones to be designed. By means of comparisons of different queries or part of them, I have a strong feeling that there are sections performing the same job with a different arrangement, but of course to refactor the whole mess I need something more that a feeling; so I am trying to determine a way to demonstrate that two statements are equivalent.
An obvious but weak response could be to ascertain that the two queries A and B produce the same recordset: if A is a subset of B and B is a subset of A, they are the same recordset; but I am not sure that this is a good idea because, of course, a recordset is not a query, the results could depend on data and specific parameter values. My questions is: there is a method to prove the equivalence of two different queries? I would say yes, because the optimization performed by the database should works on this. Someone could provide me some pointer to documentation or books digging in this? If there is no general method to prove the equivalence, there is some smart approach based on regression testing performed according to some effective heuristic that does the job?
Edited later: in case, reverse engineering the queries (by hand?) by means of relational algebra, could be a superior method to assess the query equivalence instead of using other queries and / or the computer? There are automated tools helping in performing this "reverse engineering", in case?
You probably can't prove it, since the problem seems to be NP-complete; check this SO question on query equivalence (that one is about Oracle, but there are a couple of answers / links that should be relevant for you).
You'll need to implement some "canonical query plan" generator for this (an "optimal query plan" as generated by the DBMS can be nondeterministic). In most cases, using alphabetical ordering of terms and tables as a tie-breaker will get you there.
Analyze and transform equivalent relational expressions.
Here, we shall talk about generating minimal equivalent expressions. To analyze equivalent expression, listed are a set of equivalence rules. These generate equivalent expressions for a query written in relational algebra. To optimize a query, we must convert the query into its equivalent form as long as an equivalence rule is satisfied.
Cost-Based Optimization: In cost-based optimization, the query optimizer estimates the cost of executing each possible query execution plan and selects the plan with the lowest cost. The cost of a query execution plan is usually measured in terms of the number of disk I/O operations or CPU cycles required to execute the plan.
Plan Space Exploration: The query optimizer explores the plan space, which is the set of all possible query execution plans, to find the optimal plan. This can be a complex and computationally expensive process, especially for complex queries involving multiple tables and join operations.
Query Rewriting: The query optimizer may rewrite the original query to create an equivalent query with a more efficient execution plan. For example, the optimizer may reorder the join operations or apply filter conditions earlier in the query execution plan to reduce the amount of data that needs to be processed.
Statistics: The query optimizer relies on statistics about the data in the tables being queried to estimate the cost of different query execution plans. These statistics may include information about the number of rows in each table, the distribution of values in each column, and the presence of indexes or constraints.
Caching: The query optimizer may cache the results of commonly executed queries to improve performance. This can reduce the need to execute the same query multiple times, especially in applications with frequent user queries.
Improved Query Performance: The main advantage of query optimization is that it can significantly improve the performance of queries. By selecting the most efficient query plan, the RDBMS can execute the query in the shortest possible time.
Cost Reduction: The optimization process reduces the cost of query execution by minimizing the use of system resources, such as CPU and memory, while still delivering the same results.
Reduced Complexity: Query optimization makes the query execution process more efficient and reduces the complexity of the system. By using a streamlined approach, the RDBMS can improve its overall performance.
Increased Overhead: Query optimization requires additional processing time and system resources. This overhead may be significant for large databases and can impact system performance.
Limited Scope: Query optimization may not work for all queries, especially those that are ad-hoc and not well-defined. In such cases, the RDBMS may not be able to optimize the query, leading to suboptimal performance.
Algorithmic Complexity: The process of query optimization itself can be complex, especially for large and complex queries. This can result in a higher computational cost and may require specialized expertise to implement.
As I had a specific query in mind when I was doing this, I ended up doing as @dougman suggested, over about 10% of rows the tables concerned and comparing the results, ensuring there was no out of place results.
The best you can do is compare the 2 query outputs based on a given set of inputs looking for any differences. To say that they will always return the same results for all inputs really depends on the data.
1) Real equivalency proof with Cosette:
Cosette checks (with a proof) if 2 SQL query's are equivalent and counter examples when not equivalent. It's the only way to be absolutely sure, well almost ;) You can even throw in 2 query's on their website and check (formal) equivalence right away.
that should give you an empty set. If it does, then the queries do return the same sets.if it is not empty, then the queries are different in some respect, and the result set shows you the rows that are different.
This will do the trick. If this query returns zero rows the two queries are returning the same results. As a bonus, it runs as a single query, so you don't have to worry about setting the isolation level so that the data doesn't change between two queries.
The DBMS vendors have been working on this for a very, very long time. As Rik said, it's probably an intractable problem, but I don't think any formal analysis on the NP-completeness of the problem space has been done.
However, your best bet is to leverage your DBMS as much as possible. All DBMS systems translate SQL into some sort of query plan. You can use this query plan, which is an abstracted version of the query, as a good starting point (the DBMS will do LOTS of optimization, flattening queries into more workable models).
In Oracle (depending on your version), you can tell the optimizer to switch from the cost based analyzer to the deterministic rule based analyzer (this will simplify plan analysis) with a SQL hint, e.g.
The rule-based optimizer has been deprecated since 8i but it still hangs around even thru 10g (I don't know 'bout 11). However, the rule-based analyzer is much less sophisticated: the error rate potentially is much higher.
For further reading of a more generic nature, IBM has been fairly prolific with their query-optimization patents. This one here on a method for converting SQL to an "abstract plan" is a good starting point:
Perhaps you could draw (by hand) out your query and the results using Venn Diagrams, and see if they produce the same diagram. Venn diagrams are good for representing sets of data, and SQL queries work on sets of data. Drawing out a Venn Diagram might help you to visualize if 2 queries are functionally equivalent.
CAREFUL! Functional "equivalence" is often based on the data, and you may "prove" equivalence of 2 queries by comparing results for many cases and still be wrong once the data changes in a certain way.
Massive level's of testing aren't that hard to cobble together for a SQL query. Write a proc which will iterate around a large/complete set of possible paramenters, and call each query with each set of params, and write the outputs to respective tables. Compare the two tables and there you have it.
The first step of the optimizer says to implement such expressions that are logically equivalent to the given expression. For implementing such a step, we use the equivalence rule that describes the method to transform the generated expression into a logically equivalent expression.
Although there are different ways through which we can express a query, with different costs. But for expressing a query in an efficient manner, we will learn to create alternative as well as equivalent expressions of the given expression, instead of working with the given expression. Two relational-algebra expressions are equivalent if both the expressions produce the same set of tuples on each legal database instance. A legal database instance refers to that database system which satisfies all the integrity constraints specified in the database schema. However, the sequence of the generated tuples may vary in both expressions, but they are considered equivalent until they produce the same tuples set.
The equivalence rule says that expressions of two forms are the same or equivalent because both expressions produce the same outputs on any legal database instance. It means that we can possibly replace the expression of the first form with that of the second form and replace the expression of the second form with an expression of the first form. Thus, the optimizer of the query-evaluation plan uses such an equivalence rule or method for transforming expressions into the logically equivalent one.
4a15465005