Architecture proposal for non-field projections, aggregates & GROUP BY.

0 views
Skip to first unread message

Dhruv Matani

unread,
Jul 5, 2006, 6:05:57 AM7/5/06
to The Distibuted DataBase
Hello,
Here is the proposed methodology for implementing the above
mentioned features for TDDB.
Currently, the projection list if given to the engine by the parser
in the form of a vector<string> containing the names of the fields.
This severely limits what can be projected by the select. The proposal
mentioned herein tries to make provision for non field(arithmetic) &
aggregates in the projection list.

[1] Pure field projection:
SELECT f1, f2 FROM tab1;

[2] Non-field projection(involving only arithmetic ops.):
SELECT f1+f2, f3 FROM tab1;

[3] Aggregates:
SELECT MAX(f1+f2)+f1, f3 FROM tab1;

[4] Pure Aggregation:
SELECT MAX(f1+f2) FROM tab1;
This will return ONLY 1 Row in the Result Set.

Thus, we see that there is a special case when there is ONLY an
Aggregate in the projection list.

The proposal seeks to have the parser send the engine a
vector<projected_element_t> in addition to a vector<string> for every
SELECT query. The structure is defined as follows:

struct projected_element_t
{
/* Implementation. */

tddb::base_type *
value(tddb::result_row_type& row, tddb::tuple_type agg_list);
};

row is the Row which is obtained after processing the WHERE clause in the Query.
agg_list is the list(vector) of aggregates whose value is stored in the tuple.

There needs to be one more pre-processing step whierein the values of
all the aggregates involved are computed.

So, if we have a select query as such:
SELECT MAX(id)+4, MIN(id+f1)+AVG(f1+f2), roll FROM tab1;

Then there will be another vector<aggregate_t> which has all the
aggregates involved.

struct aggregate_t
{
int agg_type;
tddb::sql_term agg_in;
};

So, there would be 3 entries in the vector<> for the above query.
For MIN(id+f1), the agg_type variable will take the value MIN(or
something like that), and the agg_in will represent the expression
(id+f1). The engine will compute the aggregates and store them in a
tuple_type variable (say agg_result). Every time the projection needs
to be performed, the value() function will be called with the Row &
the agg_result variable as paramaters. Then it can use the value(s) of
the aggregates computed, and produce the output. The engine will store
the result into a result set and send it back to the client.

This was as far as projections are concerned.
GROUP BY is relatively simpler. There will be a GROUPing object which
will need the Indices of all the fields involved. It will compute a
hash for each such element and insert the rows in the correct group.

SELECT MAX(id), f1+f2, f3 FROM tab1 GROUP BY 2, 3;

Here, the grouping algorithm will have to consult the
projection_element_t structure every time for getting the value of
(f1+f2) for that row. It will compute the hash of that, add it to the
hash of 3(f3), and then compute the final hash value. Then, the row
will be inserted into the correct bucket in the hash table. After
that, all aggregate processing will take place on a Groupwise basis.
Then, the rest is the same as above.

I hope there are no glaring errors in the above proposal.


--
-Dhruv Matani.
http://www.geocities.com/dhruvbird/

"The biggest room is the room for improvement."
-- Navjot Singh Siddhu.

Reply all
Reply to author
Forward
0 new messages