Future plans.

0 views

Skip to first unread message

Dhruv Matani

unread,

Jun 11, 2006, 8:01:52 AM6/11/06

to The Distibuted DataBase

Hello everyone.
I'm pleased to announce that currently TDDB has undergone many
bug-fixes & preformance improvements(3x in some places). Future work
on it is as described below. I realize that we won't be able to work
on it full-time like before, but I guess we should make a conscious
effort. You can take it that I'll be very supportive of it if anyone
decides to get involved full-time.

[1] GROUP BY support: Grouping will be implemented using a hash map,
and a custom hash function will hash(depending on the user input)
tuples into the correct bucket.

[2] Aggregate support: Apart from the normal Aggregate functions[SUM,
MIN, MAX, AVG, and COUNT], I would like to support MEDIAN & MODE. If
anyone has other ideas, please feel free to speak it out.

[3] LIKE clause Support: This is more under Sandy's control, and has
nothing to do with the Engine per-se.

[4] JOIN & Selection optimizations: I have realized that it would be a
big performance boost to have the parser have functions which give a
hint to the parser as to what "kind" of WHERE clause has been passed
by the user. The main optimizations will come when:
[a] There is only 1 condition.
[b] Constant evaluation of all nodes(either TRUE or FALSE).
[c] All conditions connected by AND. This is an is_simple() WHERE
clause, and JOINS involving simple WHERE clauses can be highly
optimized if the JOIN is of type Equijoin, and contains comparisons
with constant terms.

The parser will have to perform constant propagation, and transitive
propagation.

eg:

[A] SELECT * FROM tab1 as T1, tab2 as T2 where T1.x=23 AND T1.x=T2.x;
can be transformed(without loss of generality) to:
SELECT * FROM tab1 as T1, tab2 as T2 where T1.x=23 AND T2.x=23;
Using Nested-loop join for actual JOIN processing will yield 0 wasted tuples.

[B] SELECT * FROM tab1 as T1, tab2 as T2, tab3 as T3 where T1.x=23 AND
T1.x=T2.x AND T2.x=T3.x;
can be transformed(without loss of generality) to:
SELECT * FROM tab1 as T1, tab2 as T2, tab3 as T3 where T1.x=23 AND
T1.x=T2.x AND T2.x=T3.x AND T1.x=T3.x;
applying the rule above:
SELECT * FROM tab1 as T1, tab2 as T2, tab3 as T3 where T1.x=23 AND
T2.x=23 AND T3.x=23;

If we do constant propagation to the lowest tuple reading function of
the disk reader, then again, a Nested-loop JOIN will suffice, and will
result in 0 wasted Rows.

The next big issue is supporting non-field projections. For each field
in the projection list, we can create a out_type, which is a type
which will accept a Tuple, and produce the output as the result.

So, suppose we have:
SELECT a, b FROM tab1;
Then, there will be 2 out types which both return a & b respectively.

SELECT a+b FROM tab1;
WIll contain just 1 out type, which returns the value a+b given a Tuple.

Aggregates can be implemented as out-types in themselves:
SELECT max(a+b) FROM tab1;
Will again contain an out type, but that out_type which corresponds to
an aggregate will take as input a complete relation. Each tuple from
this relation will be passed to the out_type of (a+b), and then
processing will continue.

We would need to make a mangling function to get the mangled name for
each field/non-field projection with the new infrastructure.

--
-Dhruv Matani.
http://www.geocities.com/dhruvbird/

"The biggest room is the room for improvement."
-- Navjot Singh Siddhu.

Reply all

Reply to author

Forward

0 new messages