org.voltdb.planner.Fragmentizer
Our code currently only works with two "fragments". One (the bottom) is always sent to all nodes. The other (the top) is run at one node. The larger plan is split into two fragments at the send and receive plan node.
So when there are two sets of send and receive plan nodes, we generate three fragments, and we don't currently have a way to actually execute three fragments. We assume one or two everywhere.
The trick for joins of two partitioned tables on the partition key is that we don't actually need to generate two send/receive pairs. We can do the nestloop-index join of the two source tables in the bottom fragment at each partition, then simply aggregate the results.
So you don't have to change the Fragmentizer, you need to teach the planner to recognize this special case and generate a different plan for it.
Thanks Mike. Keep the questions coming.
-John
>
> Thanks,
> Mike
>
>
org.voltdb.planner.FragmentizerOn Feb 19, 2012, at 9:05 PM, Michael Alexeev wrote:
> Hi All,
>
> Consider the following select which joins two partitioned tables.
>
> SELECT A.T1_PK, A.T1_C, B.T2_PK FROM T1 A, T2 B WHERE A.T1_C = B.T2_PK;
>
> Both tables are partitioned on their respective PK - T1_PK for T1 and T2_PK for T2. The number of partitions is 2. The winner plan is
>
> RETURN RESULTS TO STORED PROCEDURE
> RECEIVE FROM ALL PARTITIONS
> SEND PARTITION RESULTS TO COORDINATOR
> NESTLOOP INDEX JOIN
> inline (INDEX SCAN of "OBJECT_DETAIL" using "SYS_IDX_SYS_PK_10020_10021" (unique-scan covering))
> RECEIVE FROM ALL PARTITIONS
> SEND PARTITION RESULTS TO COORDINATOR
> SEQUENTIAL SCAN of "ASSET"
>
> From this plan I would expect VoltDB to generate four fragments to collect the data from all possible partitions combinations - T11+T21, T11+T22, T21+T21 and T21+T22 plus two fragments to aggregate results from first two and from third and fourth fragments. And the final one to aggregate everything together. In reality, there are only three fragments - two aggregate ones with dependencies and one without.
>
> Could someone explain me the logic behind fragmentation and/or point to the relevant code?
Our code currently only works with two "fragments". One (the bottom) is always sent to all nodes. The other (the top) is run at one node. The larger plan is split into two fragments at the send and receive plan node.
So when there are two sets of send and receive plan nodes, we generate three fragments, and we don't currently have a way to actually execute three fragments. We assume one or two everywhere.
The trick for joins of two partitioned tables on the partition key is that we don't actually need to generate two send/receive pairs. We can do the nestloop-index join of the two source tables in the bottom fragment at each partition, then simply aggregate the results.
So you don't have to change the Fragmentizer, you need to teach the planner to recognize this special case and generate a different plan for it.
Hi John, thanks for the clarification.
Below is the example of the plan for three tables join
RETURN RESULTS TO STORED PROCEDURE
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
NEST LOOP JOIN
SEQUENTIAL SCAN of "T3"
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
NESTLOOP INDEX JOIN
inline (INDEX SCAN of "OBJECT_DETAIL" using "SYS_IDX_SYS_PK_10020_10021" (unique-scan covering))
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
SEQUENTIAL SCAN of "ASSET"
It looks like that for a multi-partition join we always have Receive/Send node pair followed by one of the nested loop nodes. If my observation is correct, then we can modify Fragmentaizer to bypass all Receive/Send pairs other then the first one if they followed by nested loop node and both partition keys are part of join where clause. At this moment, AbstractJoinPlanNode.m_predicate should already be fully constructed so I assume it would be possible to parse it to extract involved columns. Once the set of columns for each table is identified and partition key for each table is in the corresponding set then Fragmentaizer can safely skip receive/send pair. If not, proceed with the current logic. The only draw back I see so far is that CompiledPlan. fullWinnerPlan and few other members from the final CompiledPlan would be out of sync because of extra receive/send nodes but it looks like they are used for execution anyway.
Does it seem reasonable to you?
One thing I wasn’t able to find out yet is how to get the partition information for a given table. TVE type contains only table name.
Mike