Number of Fragments for the Select Statement

Michael Alexeev

unread,

Feb 19, 2012, 9:05:08 PM2/19/12

to voltd...@googlegroups.com

Hi All,

Consider the following select which joins two partitioned tables.

SELECT A.T1_PK, A.T1_C, B.T2_PK FROM   T1 A, T2 B WHERE   A.T1_C = B.T2_PK;

Both tables are partitioned on their respective PK - T1_PK for T1 and T2_PK for T2. The number of partitions is 2. The winner plan is

RETURN RESULTS TO STORED PROCEDURE
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
   NESTLOOP INDEX JOIN
   inline (INDEX SCAN of "OBJECT_DETAIL" using "SYS_IDX_SYS_PK_10020_10021" (unique-scan covering))
    RECEIVE FROM ALL PARTITIONS
     SEND PARTITION RESULTS TO COORDINATOR
      SEQUENTIAL SCAN of "ASSET"

From this plan I would expect VoltDB to generate four fragments to collect the data from all possible partitions combinations - T11+T21, T11+T22, T21+T21 and T21+T22 plus two fragments to aggregate results from first two and from third and fourth fragments. And the final one to aggregate everything together. In reality, there are only three fragments - two aggregate ones with dependencies and one without.

Could someone explain me the logic behind fragmentation and/or point to the relevant code?

Thanks,
Mike

John Hugg

unread,

Feb 19, 2012, 10:11:29 PM2/19/12

to voltd...@googlegroups.com

org.voltdb.planner.Fragmentizer

Our code currently only works with two "fragments". One (the bottom) is always sent to all nodes. The other (the top) is run at one node. The larger plan is split into two fragments at the send and receive plan node.

So when there are two sets of send and receive plan nodes, we generate three fragments, and we don't currently have a way to actually execute three fragments. We assume one or two everywhere.

The trick for joins of two partitioned tables on the partition key is that we don't actually need to generate two send/receive pairs. We can do the nestloop-index join of the two source tables in the bottom fragment at each partition, then simply aggregate the results.

So you don't have to change the Fragmentizer, you need to teach the planner to recognize this special case and generate a different plan for it.

Thanks Mike. Keep the questions coming.

-John

>
> Thanks,
> Mike
>
>

Michael Alexeev

unread,

Feb 19, 2012, 11:08:49 PM2/19/12

to voltd...@googlegroups.com

Hi John,

On Sun, Feb 19, 2012 at 10:11 PM, John Hugg <jh...@voltdb.com> wrote:

On Feb 19, 2012, at 9:05 PM, Michael Alexeev wrote:

> Hi All,
>
> Consider the following select which joins two partitioned tables.
>
> SELECT A.T1_PK, A.T1_C, B.T2_PK FROM T1 A, T2 B WHERE A.T1_C = B.T2_PK;
>
> Both tables are partitioned on their respective PK - T1_PK for T1 and T2_PK for T2. The number of partitions is 2. The winner plan is
>
> RETURN RESULTS TO STORED PROCEDURE
> RECEIVE FROM ALL PARTITIONS
> SEND PARTITION RESULTS TO COORDINATOR
> NESTLOOP INDEX JOIN
> inline (INDEX SCAN of "OBJECT_DETAIL" using "SYS_IDX_SYS_PK_10020_10021" (unique-scan covering))
> RECEIVE FROM ALL PARTITIONS
> SEND PARTITION RESULTS TO COORDINATOR
> SEQUENTIAL SCAN of "ASSET"
>
> From this plan I would expect VoltDB to generate four fragments to collect the data from all possible partitions combinations - T11+T21, T11+T22, T21+T21 and T21+T22 plus two fragments to aggregate results from first two and from third and fourth fragments. And the final one to aggregate everything together. In reality, there are only three fragments - two aggregate ones with dependencies and one without.
>
> Could someone explain me the logic behind fragmentation and/or point to the relevant code?

org.voltdb.planner.Fragmentizer

Our code currently only works with two "fragments". One (the bottom) is always sent to all nodes. The other (the top) is run at one node. The larger plan is split into two fragments at the send and receive plan node.

So when there are two sets of send and receive plan nodes, we generate three fragments, and we don't currently have a way to actually execute three fragments. We assume one or two everywhere.

Yes, in case of three fragments the second top fragment dependency overrides the first one's inside of VoltProcedure::slowPath thus leading to a missing dependency.

The trick for joins of two partitioned tables on the partition key is that we don't actually need to generate two send/receive pairs. We can do the nestloop-index join of the two source tables in the bottom fragment at each partition, then simply aggregate the results.

So you don't have to change the Fragmentizer, you need to teach the planner to recognize this special case and generate a different plan for it.

It doesn't have to be join on the partition key. If both tables are not partitioned on the joining column the plan still has two receive/send pairs

RETURN RESULTS TO STORED PROCEDURE
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR

NEST LOOP JOIN
SEQUENTIAL SCAN of "T1"

RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR

SEQUENTIAL SCAN of "T2"

I guess, the special case is the join of two partitioned tables period, right?
The 'correct' than plan should look like:

RETURN RESULTS TO STORED PROCEDURE
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR

   NEST LOOP JOIN
    SEQUENTIAL SCAN of "T1"
      SEQUENTIAL SCAN of "T2"

My first naive approach was to modify VoltProcedure::slowPath to be able to keep track of more then one top fragment per statement. But it didn't go far.

Thanks for pointing me in the right direction.
Mike

John Hugg

unread,

Feb 19, 2012, 11:25:37 PM2/19/12

to voltd...@googlegroups.com

Right. We don't really plan to support joining two partitioned tables on non-partition-key columns anytime soon. That would likely involve a very large amount of data movement, unless we could determine that any predicates would sufficiently filter the data.

So the special case is joining on the partition key, because that doesn't involve moving data around. Your 'correct' plan looks right.

-John

Michael Alexeev

unread,

Feb 20, 2012, 7:30:51 PM2/20/12

to voltd...@googlegroups.com

Agree, if both tables are joined on the columns they are partitioned on than no data moving is involved across the sites. The final result is a simple aggregate of joins of respective partitions from all sites.

My only question is that in the original example attached to the ENG-490 one of the tables is joined on non-partition-key.

SELECT A.ASSET_ID, A.OBJECT_DETAIL_ID, OD.OBJECT_DETAIL_ID
FROM ASSET A, OBJECT_DETAIL OD
WHERE A.OBJECT_DETAIL_ID = OD.OBJECT_DETAIL_ID;

From the project.xml ASSERT table is partitioned on ASSERT_ID and OBJECT_DETAIL is on OBJECT_DETAIL_ID.

In this case there still is a need for cross-site join and it won't be supported. Is it correct?

Mike

John Hugg

unread,

Feb 21, 2012, 9:24:07 AM2/21/12

to voltd...@googlegroups.com

The example might be wrong. Joining two replicated tables on the partitioned key is the goal.

John Hugg

unread,

Feb 21, 2012, 9:36:04 AM2/21/12

to John Hugg, voltd...@googlegroups.com

Oh darn. I meant "partitioned tables".

Michael Alexeev

unread,

Feb 21, 2012, 7:15:45 PM2/21/12

to voltd...@googlegroups.com

Hi John, thanks for the clarification.

Below is the example of the plan for three tables join

RETURN RESULTS TO STORED PROCEDURE
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
NEST LOOP JOIN

SEQUENTIAL SCAN of "T3"

    RECEIVE FROM ALL PARTITIONS
     SEND PARTITION RESULTS TO COORDINATOR
      NESTLOOP INDEX JOIN
      inline (INDEX SCAN of "OBJECT_DETAIL" using "SYS_IDX_SYS_PK_10020_10021" (unique-scan covering))
       RECEIVE FROM ALL PARTITIONS
        SEND PARTITION RESULTS TO COORDINATOR
         SEQUENTIAL SCAN of "ASSET"

It looks like that for a multi-partition join we always have Receive/Send node pair followed by one of the nested loop nodes. If my observation is correct, then we can modify Fragmentaizer to bypass all Receive/Send pairs other then the first one if they followed by nested loop node and both partition keys are part of join where clause. At this moment, AbstractJoinPlanNode.m_predicate should already be fully constructed so I assume it would be possible to parse it to extract involved columns. Once the set of columns for each table is identified and partition key for each table is in the corresponding set then Fragmentaizer can safely skip receive/send pair. If not, proceed with the current logic. The only draw back I see so far is that CompiledPlan. fullWinnerPlan and few other members from the final CompiledPlan would be out of sync because of extra receive/send nodes but it looks like they are used for execution anyway.

Does it seem reasonable to you?

One thing I wasn’t able to find out yet is how to get the partition information for a given table. TVE type contains only table name.

Mike

Michael Alexeev

unread,

Feb 23, 2012, 9:12:06 PM2/23/12

to voltd...@googlegroups.com

Hi John,

On Sun, Feb 19, 2012 at 10:11 PM, John Hugg <jh...@voltdb.com> wrote:

I actually did change the Fragmentizer not to split send/receive pairs but simply bypass them in case of join on a partition key if total number of fragments is greater then 1. The first send/receive pair gets always split. It seems working fine. I still need to figure out how to get to the table's definition by its name in order to get a partition column name but I don't think it would be a big problem.

Do you still think that the planner approach is the way to go?

Mike

John Hugg

unread,

Feb 23, 2012, 9:14:26 PM2/23/12

to voltd...@googlegroups.com

If it works, it's probably reasonable. Send a patch and I'll take a look. I think I'm still one behind, but I'm holding out for use to merge the "voltcore-integration" branch to master, which should happen soon.

-John

Reply all

Reply to author

Forward