ENG-2503/2504

22 views
Skip to first unread message

Michael Alexeev

unread,
Mar 8, 2012, 9:09:55 PM3/8/12
to voltd...@googlegroups.com
Hi Ariel,

I am not quite sure I understand why do you think it's an error after reading the bug comments. Could you please elaborate. All I can see that allocation keeps growing while iterating over the table until it reaches the limit. No calls to decrease memory happens until after the exception is thrown. Are you saying that certain memory could be released earlier to avoid hitting the limit or is there something else?

Thanks,
Mike

Ariel Weisberg

unread,
Mar 8, 2012, 9:50:08 PM3/8/12
to voltd...@googlegroups.com
Hi Michael,

Awesome work on the partitioned to partitioned join!

The cardinality of lgn_names is 10, so in the case of a group by or distinct there should only be 10 result rows returned to the coordinator by each partition. They are slightly different plans in that in 2504 the sequential scan + aggregate node should not generate more then 10 hash buckets in the aggregate executor since it is grouping on lgn_name with a cardinality of 10. The aggregate is already being correctly pushed down by the planner to each partition.

In 2503 the issue is that the distinct is not pushed down to each partition like the aggregates are so that is being emitted from the planner incorrectly.

Keep in mind that the sequential scan executor doesn't generate a temp table with output tuples. The sequential scan executor returns as its output table the actual persistent table so it is iterated directly. The index scan executor generates an output table containing pointers to rows so it can easily hit the temp space limits when feeding an aggregate executor because it materializes the entire set of tuple pointers before the aggregate executor runs.

I kind of wish we just did row at a time executors :-(

Thanks,
Ariel

Ryan Betts

unread,
Mar 8, 2012, 9:51:29 PM3/8/12
to voltd...@googlegroups.com
On Thu, Mar 8, 2012 at 9:09 PM, Michael Alexeev
<michael...@gmail.com> wrote:


(I see Ariel answered this while I was typing ... so apologies to
Ariel for re-answering. Perhaps two comments are better than one?)

Given an arbitrary number of rows, but only 10 unique values, the most
memory the distinct should require is 10 * (size of projected tuple).

The problem is that the distinct is on top of the receive. This plan:

RETURN RESULTS TO STORED PROCEDURE
DISTINCT
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
SEQUENTIAL SCAN of "LOG"


Should really look like this:


RETURN RESULTS TO STORED PROCEDURE
DISTINCT
RECEIVE FROM ALL PARTITIONS
SEND PARTITION RESULTS TO COORDINATOR
DISTINCT
SEQUENTIAL SCAN of "LOG"

And the distinct should be inlined with the scan. We do this push down
for other aggregates but a simple distinct got missed...

Ryan.

Michael Alexeev

unread,
Mar 13, 2012, 9:08:46 PM3/13/12
to voltd...@googlegroups.com
Ariel, Ryan,



Thanks for the very thorough  explanation. I think I got the general idea but still few clouds remain...
I don't think that simple push down of the distinct node would solve the whole problem because even for a replicated table with the following access plan


RETURN RESULTS TO STORED PROCEDURE
 DISTINCT
  SEQUENTIAL SCAN of "LOG"

I am still hitting the memory limits. Anyway, I will take a look at how  it is handled for other aggregates to get a better idea. 

Mike

Ryan Betts

unread,
Mar 13, 2012, 9:43:28 PM3/13/12
to voltd...@googlegroups.com
> Thanks for the very thorough  explanation. I think I got the general idea
> but still few clouds remain...
> I don't think that simple push down of the distinct node would solve the
> whole problem because even for a replicated table with the following access
> plan
>
>
> RETURN RESULTS TO STORED PROCEDURE
>  DISTINCT
>   SEQUENTIAL SCAN of "LOG"
>
> I am still hitting the memory limits. Anyway, I will take a look at how  it
> is handled for other aggregates to get a better idea.


If the temp table limit is hit, it means the table is being copied as
a result of the scan; that copy isn't necessary. The EE glues
executors together Source -> output to temp table -> next executor
... As an optimization, if the source doesn't alter tuples, it can
eliminate the temp table and create the chain source -> next executor.

See this code, for example, in seqscanexecutor.cpp:

//
// OPTIMIZATION: If there is no predicate for this SeqScan,
// then we want to just set our OutputTable pointer to be the
// pointer of our TargetTable. This prevents us from just
// reading through the entire TargetTable and copying all of
// the tuples. We are guarenteed that no Executor will ever
// modify an input table, so this operation is safe
//
if (!this->needsOutputTableClear())
{
node->setOutputTable(node->getTargetTable());
}

The temp table limit is configurable, btw. (See
https://community.voltdb.com/docs/MgtGuide/HostConfigDBOpts). Or,
with a little hacking, you can reduce the limit to something very
small and not need to deal with loading a large dataset.

On another note, John enabled the github wiki:
https://github.com/VoltDB/voltdb/wiki. We intend to fill this out with
some more developer focused docs.

Ryan.

Michael Alexeev

unread,
Mar 15, 2012, 7:26:16 PM3/15/12
to voltd...@googlegroups.com
Ryan,

On Tue, Mar 13, 2012 at 9:43 PM, Ryan Betts <rbe...@voltdb.com> wrote:
 
The temp table limit is configurable, btw.  (See
https://community.voltdb.com/docs/MgtGuide/HostConfigDBOpts).  Or,
with a little hacking, you can reduce the limit to something very
small and not need to deal with loading a large dataset.

Thanks for the tip. The size of the temp table in the project.xml is in MB, right? 

On another note, John enabled the github wiki:
https://github.com/VoltDB/voltdb/wiki. We intend to fill this out with
some more developer focused docs.


Looking at the content, it's great! Thank you very much! 
Ryan.

Mike 

Michael Alexeev

unread,
Mar 20, 2012, 9:27:25 PM3/20/12
to voltd...@googlegroups.com
Ryan,


Should really look like this:


RETURN RESULTS TO STORED PROCEDURE
 DISTINCT
 RECEIVE FROM ALL PARTITIONS
  SEND PARTITION RESULTS TO COORDINATOR
    DISTINCT
     SEQUENTIAL SCAN of "LOG"

And the distinct should be inlined with the scan.

I modified assembler to push down distinct node but have a question about making it inline.
Does it mean that there actually should be two distinct nodes: The first one is the parent of the scan node and another one (inlined) is part of the scan node? That seems to be the case with the projection node. Could you please clarify.

Mike

Ryan Betts

unread,
Mar 20, 2012, 10:03:57 PM3/20/12
to voltd...@googlegroups.com
>> RETURN RESULTS TO STORED PROCEDURE
>>  DISTINCT

The above DISTINCT will re-distinct the results from each partition.

>>  RECEIVE FROM ALL PARTITIONS
>>   SEND PARTITION RESULTS TO COORDINATOR
>>     DISTINCT

This DISTINCT will produce a unique set from each partition.

> I modified assembler to push down distinct node but have a question about
> making it inline.
> Does it mean that there actually should be two distinct nodes: The first one
> is the parent of the scan node and another one (inlined) is part of the scan
> node? That seems to be the case with the projection node. Could you please
> clarify.

One distinct is distributed and run at each partition. The other is run on
the results collected from each partition in case the partitions
return duplicates.
(If they don't return duplicates, it will be equivalent to a UNION).

Ryan.

Michael Alexeev

unread,
Mar 20, 2012, 11:34:10 PM3/20/12
to voltd...@googlegroups.com
Ryan,

Yes, the above makes perfect sense and this is how it's implemented.
 The final node tree for the whole plan currently looks like the following

Send
   Projection
       Distinct
           Receive
               Send
                   Projection - new
                         Distinct - new
                              Scan
                                   Inlined Projection

My question was about inlining the lower distinct node - the distributed one. I am still not 100% clear yet what advantages inline node provides and how it is implemented and this is why I am having all these questions :). The only other example of the use of inline node I was able to find is the projection node.
The scan node has inlined projection node and at the same time there is one more projection node higher up in the tree with the exactly same schema. why does it need to be in two places places? If distinct needs to be inlined as well should it be
 Send
        ....              
             Send
                              Scan
                                   Inlined Projection
                                   Inlined Distinct

 or
        .....
             Send
                   Projection - new
                         Distinct - new
                              Scan
                                   Inlined Projection
                                   Inlined Distinct
or something else?

It looks like executors are called in sequence determined by the node position in the tree and the purpose of  the inline node is to tie input/output tables to avoid an extra copy and if this is the case, there should be a distinct node on top of scan.

Sorry if it doesn't make much sense :)
Mike

Paul Martel

unread,
Mar 21, 2012, 6:48:10 AM3/21/12
to voltd...@googlegroups.com
Mike,

In theory, there should only need to be one Projection in the plan. I would expect it to be inlined with the Scan.

A simple alternative implementation of Distinct that works well with low cardinality columns is to make its execution identical to
a HashAggregate that had all the selected columns as grouping columns and had no aggregate function columns.
You could go so far as to replace the Distinct plan node with a HashAggregate node, and that's actually what I'd advise rather than re-inventing that wheel. A useful experiment would be to examine the plan node tree for a variation of the original DISTINCT query that replaced the DISTINCT with a GROUP BY on all selected columns.
The query result and the PlanNode tree for that case should be exactly what you are aiming for, including its placement of Projection nodes. Any additional overhead the HashAggregate might impose in this special case should be identified and addressed as performance bugs. I don't expect they'd have anywhere near the impact of the performance bug you are now addressing.

As I mentioned earlier, there are additional optimizations possible for high-cardinality cases. But those optimizations apply equally to high-cardinality GROUP BY queries, all the more reason to think we are better off in general without a separate Distinct node.

--paul

mike alexeev

unread,
Mar 21, 2012, 1:21:17 PM3/21/12
to voltd...@googlegroups.com
Paul,

I understand what you are saying and will experiment with DISTINCT with
GROUP BY plan but ... I am at a loss how it will solve the original problem
from 2503/2504 issues.

The execution plan corresponding to simple "select distinct C from T" (T is
replicated to simplify the plan) is
Send
Projection
Distinct
Sequential Scan
inline Projection

out-of-memory exception happens within seqscan executor for a simple reason
that optimization chaining target and output tables for this node is off.
SeqScanPlanNode::needsOutputTableClear returns true because of inlined
projection node. In such scenario SeqScanExecutor::p_execute iterates over
the whole target table inserting new tuple into the output table for each
row eventually exceeding memory limit. The distinct executor (and I suspect
the aggregated one as well) is invoked only after sequential one finishes
which is too late in this case.

One way to get around it is to change the plan to be
Send
Projection
Sequential Scan
inline Projection
inline Distinct

and call distinct expression from within SeqScanExecutor::p_execute as one
more optimization to reduce the number of tuples inserted into the scan's
output table. Does it seem reasonable?

Mike

Paul Martel

unread,
Mar 21, 2012, 3:23:43 PM3/21/12
to voltd...@googlegroups.com
Mike,

I see what you mean about the copying behavior of SeqScanExecutor::p_execute.
So, HashAggregate suffers from the same issue as Distinct.
They BOTH want to be inlined.

Off the top of my head, I think the way to achieve this for Distinct, in broad strokes, MAY be to:

1. factor out the behavior of the (per input row) loop body in DistinctExecutor::p_execute -- call it DistinctExecutor::exec_one
2. annotate the SeqScanExecutor with an "inlined" DistinctExecutor object.
3. pass that annotation to the TempTable factory causing it to create a new kind of (output) TempTable -- call it DistinctTempTable.
4. implement DistinctTempTable::insertTuple to call DistinctExecutor::exec_one, populating a smaller underlying (output) TempTable.
5. implement DistinctTempTable::iterator to return an iterator over that underlying TempTable.

Does that make sense?

I'm not sure how best to draw it as a chain of nested Plan Nodes
 -- the way you show the inline Projection and inline Distinct as same-level children of the Sequential Scan seems to get the idea across OK,
as long as it is understood that the mechanism by which the inline nodes feed and are fed tuples is a little different than the usual bottom-up protocol.

--paul

Michael Alexeev

unread,
Mar 21, 2012, 9:40:37 PM3/21/12
to voltd...@googlegroups.com
Paul,

Yes, it does.

Off the top of my head, I think the way to achieve this for Distinct, in broad strokes, MAY be to:

1. factor out the behavior of the (per input row) loop body in DistinctExecutor::p_execute -- call it DistinctExecutor::exec_one
2. annotate the SeqScanExecutor with an "inlined" DistinctExecutor object.

Every node already has a map to hold its inline nodes. The key is  node type. I don't think anything special needs to be done there.
 
3. pass that annotation to the TempTable factory causing it to create a new kind of (output) TempTable -- call it DistinctTempTable.
4. implement DistinctTempTable::insertTuple to call DistinctExecutor::exec_one, populating a smaller underlying (output) TempTable.
5. implement DistinctTempTable::iterator to return an iterator over that underlying TempTable.

Why can't we simply reuse  existing output table of the scan node Or use DistinctExecutor::p_init logic for that purpose to set the output table for the scan node?

I was also thinking of calling DistinctPlanNode::getDistinctExpression::eval chain directly from the SeqScanPlanNode::p_execute  loop over the table similar to the projection node and based on the outcome either insert the tuple or continue.

I'm not sure how best to draw it as a chain of nested Plan Nodes
 -- the way you show the inline Projection and inline Distinct as same-level children of the Sequential Scan seems to get the idea across OK,
as long as it is understood that the mechanism by which the inline nodes feed and are fed tuples is a little different than the usual bottom-up protocol.

Yes, ordering is a problem. So far, the only inline node is the projection one. Distinct and HashAggregate are mutually exclusive so the execution order can be simply hard-coded: Projection first, then either Distinct or HasAggregate. Of cause, it the variety of the inline nodes would grow more generic solution would be required.

Mike

Paul Martel

unread,
Mar 22, 2012, 12:41:17 AM3/22/12
to voltd...@googlegroups.com
On Wed, Mar 21, 2012 at 9:40 PM, Michael Alexeev <michael...@gmail.com> wrote:
Paul,

Yes, it does.

Off the top of my head, I think the way to achieve this for Distinct, in broad strokes, MAY be to:

1. factor out the behavior of the (per input row) loop body in DistinctExecutor::p_execute -- call it DistinctExecutor::exec_one
2. annotate the SeqScanExecutor with an "inlined" DistinctExecutor object.

Every node already has a map to hold its inline nodes. The key is  node type. I don't think anything special needs to be done there.

Good catch.
 
 
3. pass that annotation to the TempTable factory causing it to create a new kind of (output) TempTable -- call it DistinctTempTable.
4. implement DistinctTempTable::insertTuple to call DistinctExecutor::exec_one, populating a smaller underlying (output) TempTable.
5. implement DistinctTempTable::iterator to return an iterator over that underlying TempTable.

Why can't we simply reuse  existing output table of the scan node Or use DistinctExecutor::p_init logic for that purpose to set the output table for the scan node?

Yes, something like the existing output table still needs to be constructed (what I refer to as the underlying temptable).
I don't know off-hand which class should take responsibility for that.
 

I was also thinking of calling DistinctPlanNode::getDistinctExpression::eval chain directly from the SeqScanPlanNode::p_execute  loop over the table similar to the projection node and based on the outcome either insert the tuple or continue.

I'm not sure how best to draw it as a chain of nested Plan Nodes
 -- the way you show the inline Projection and inline Distinct as same-level children of the Sequential Scan seems to get the idea across OK,
as long as it is understood that the mechanism by which the inline nodes feed and are fed tuples is a little different than the usual bottom-up protocol.

Yes, ordering is a problem. So far, the only inline node is the projection one. Distinct and HashAggregate are mutually exclusive so the execution order can be simply hard-coded: Projection first, then either Distinct or HasAggregate. Of cause, it the variety of the inline nodes would grow more generic solution would be required.

If I understand you, you are suggesting that we wait for more case(s) to tip us over to refactoring for a general solution.
I was thinking that we had already arrived at that point.
I (the typical eager new hire) am thinking that we MIGHT want to remodel inline Projection and/or Predicate filtering
through this same universal "smart output table" approach.
That would give us the critical mass to justify adopting a more generic solution earlier.
That's a judgement call, whether it's less risky to be strategic and pay now or to be tactical and pay later.
Both approaches have their benefits.

mike alexeev

unread,
Mar 22, 2012, 10:10:36 AM3/22/12
to voltd...@googlegroups.com
I will take the more pragmatic approach (hard-code) for now just to make sure the whole idea works. It shouldn't take long to verify.
 
To implement the generic solution the inline node ordering needs to be established. The first thing which comes to mind is the reverse order in which nodes are inlined but I am not sure whether it would work for all cases or not. Any suggestions there?

Paul Martel

unread,
Mar 23, 2012, 11:26:36 AM3/23/12
to voltd...@googlegroups.com
The pragmatic approach should work -- by definition?

As for the generic solution, I'm not sure why the reversed order would always make sense, but some simple rule like that seems in order. I suspect that reverse order may work short-term as an accidental effect of the order in which various inline optimizations are considered. In the long term, this aspect of the optimizer may need to be reworked to force it to follow a convention of inlining nodes in a reliable order to drive execution. Any complex logic along these lines seems better placed in the planner/optimizer, leaving the executor to iterate over a simple sequence of inlined nodes and apply them in fixed order.

Michael Alexeev

unread,
Mar 23, 2012, 5:57:42 PM3/23/12
to voltd...@googlegroups.com

Hi Paul,


On Fri, Mar 23, 2012 at 11:26 AM, Paul Martel <pma...@voltdb.com> wrote:
The pragmatic approach should work -- by definition?

No, not at all. I was simply not very comfortable making drastic changes :)

But anyway, looks like you were right about critical mass. Not only SeqScan executor is affected but also IndexScan, NestLoop and NestLoopIndex are. Consider the following select:

Select distinct t1.c1 from t1, t2 where t1.pk1 = t2.pk2;


The current access plan is


 RETURN RESULTS TO STORED PROCEDURE

 DISTINCT


 RECEIVE FROM ALL PARTITIONS
  SEND PARTITION RESULTS TO COORDINATOR

   NESTLOOP INDEX JOIN
   inline (INDEX SCAN of "T1" using "SYS_IDX_PK_T1_10026" (unique-scan covering))
    SEQUENTIAL SCAN of "T2"

with the proposed changes the plan may look like


 RETURN RESULTS TO STORED PROCEDURE

 DISTINCT


 RECEIVE FROM ALL PARTITIONS
  SEND PARTITION RESULTS TO COORDINATOR

   NESTLOOP INDEX JOIN
   inline (INDEX SCAN of "T1" using "SYS_IDX_PK_T1_10026" (unique-scan covering)

             inline (DISTINCT)

           )
    SEQUENTIAL SCAN of "T2"


With the added complexity (inline within inline) the straightforward approach will make NestLoopIndexExecuto logic almost unmanageable.

 

As for the generic solution, I'm not sure why the reversed order would always make sense, but some simple rule like that seems in order. I suspect that reverse order may work short-term as an accidental effect of the order in which various inline optimizations are considered. In the long term, this aspect of the optimizer may need to be reworked to force it to follow a convention of inlining nodes in a reliable order to drive execution. Any complex logic along these lines seems better placed in the planner/optimizer, leaving the executor to iterate over a simple sequence of inlined nodes and apply them in fixed order.

If inlining is the right approach than I have few questions I would like to clarify.

1.       Where inline node should go? The first scan node operating on the same table as Distinct/HashAggregate encountered by working down the plan graph may be a right choice.

2.       When to disable the optimization? For example, if the distinct expression involves more then one table (right now it’s not possible simply because only one column in the expression is supported). Are there any other limitations?

3.       Is PlanAssembler::handleDistinct the right place to do the inlining or  is new  planners/microoptimizations/PushDownDistinctIntoScans.java  the better choice similar to the existing  PushdownLimitsIntoScans.java?

4.       The inline order. I totally agree with you that it should be established by PlanAssembler/Optimizer and not by the executors but I am not sure how.

Am I over-complicating things?

Mike

Michael Alexeev

unread,
Mar 26, 2012, 9:53:05 PM3/26/12
to voltd...@googlegroups.com
Hi All,

I modified planner to push down DISTINCT node to each partition and inline it within a corresponding Scan node. I also made sequential scan executor to take advantage of the inline distinct node while iterating over the target table to minimize the size of the output table. It seems to be working (at least I am not getting out-of-memory error with ~20M rows)

But, I am beginning to question whether inlining is a right way to go. If the answer is yes than (Hash)Aggreagte also needs to be inlined simply because out of memory condition happens within the very first executor - Sequential Scan. With this approach inlining becomes almost like a rule rather than an exception and brings host of tough questions like execution order, node compatibility, etc.

Maybe a bigger memory threshold is the simple answer or is there any other way around?

Mike  

Ryan Betts

unread,
Mar 26, 2012, 10:13:39 PM3/26/12
to voltd...@googlegroups.com
>
> Maybe a bigger memory threshold is the simple answer or is there any other
> way around?
>

Hey there. Sorry this issue has followed a bit of a rabbit hole. Can
you send some of your code for this change? I think I need to read
some code carefully to form an opinion on the next best steps.

Ryan.

Ariel Weisberg

unread,
Mar 26, 2012, 10:22:51 PM3/26/12
to voltd...@googlegroups.com
Hi,

I am confused, the sequential scan executor supports distinct? There a lot of plan graph configs that don't actually work as you would expect. The executor has to support the inlined node. If you look at  https://github.com/VoltDB/voltdb/blob/master/src/ee/executors/seqscanexecutor.cpp  it doesn't check for that kind of inlined node.

You would have to create a new plan node above the seqscan that is a distinct executor. If the seqscan node does the optimization where it presents itself as the output table it should work as expected.

Somewhat orthogonal to what you are working on, but Inlining is a crappy hack that should go away. It exists because executors have ginormous input/output tables instead of being chained together and outputing a row at a time via iterators. By inlining functionality (and duplicating the code, blech!) we reduce the number of input/output tables and the number of times a tuple has to be copied as well as the amount of temp table space used during a query. That wouldn't be necessary if each executor only materialized a tuple at a time.

Thanks,
Ariel

Ariel Weisberg

unread,
Mar 26, 2012, 10:24:09 PM3/26/12
to voltd...@googlegroups.com
Hi,

OK, looks like I need to read further back in the discussion.

Sorry,
Ariel

Michael Alexeev

unread,
Mar 26, 2012, 10:42:44 PM3/26/12
to voltd...@googlegroups.com
Hi Ryan, here are the patches.

Mike
0082-ENG-2503.patch
0083-ENG-2503.patch
0084-ENG-2503.patch
0085-ENG-2503.patch

Paul Martel

unread,
Mar 26, 2012, 11:02:37 PM3/26/12
to voltd...@googlegroups.com
Mike,
Congratulations on your success so far.

I agree with you that while inlining offers some general benefits, its widespread use has lots of implications and complications.
It amounts to a shift in the basic architecture of the executor from a (primarily) bottom-up push-based system to a (primarily) top-down pull-based system.

Actually, executor trees for pure push-based systems tend to look structurally identical to executor trees for pull-based (iterator-based) systems -- they're just processed differently.  Much of the structural complication arises from compromises that blend elements of the two systems. Inlining in a push-based system and pre-batching results in a pull-based system are classic examples. This is where the duplication of effort comes in that Ariel mentioned, as each feature gets packaged in two different styles, one of which appears as a "wart" to the prevailing architecture.

In my opinion, the real architectural issue is too tight a coupling of "specific node functionality" with generic node flow control. In other words, by refactoring the executor node code to separate the various forms of tuple processing from the push-based (or pull-based) boilerplate code that manages where the tuples are coming from and where they are going, we should be able to develop components that work equally well in a push-based, pull-based, or hybrid system without a lot of duplication, and possibly with some reduction/reuse of that boiler-plate.

I look forward to reviewing your code patches.
--paul

Michael Alexeev

unread,
Mar 26, 2012, 11:20:20 PM3/26/12
to voltd...@googlegroups.com
Hi Ariel,

On Mon, Mar 26, 2012 at 10:22 PM, Ariel Weisberg <awei...@voltdb.com> wrote:
Hi,

I am confused, the sequential scan executor supports distinct? There a lot of plan graph configs that don't actually work as you would expect. The executor has to support the inlined node. If you look at  https://github.com/VoltDB/voltdb/blob/master/src/ee/executors/seqscanexecutor.cpp  it doesn't check for that kind of inlined node.

Yes, I know. I modified seqscanexecutor::p_execute to look for inlined Distinct node

You would have to create a new plan node above the seqscan that is a distinct executor. If the seqscan node does the optimization where it presents itself as the output table it should work as expected.

But adding a distinct node above the seqscan for each partition won't solve the ENG-2503. It will reduce the number of tuples sent to coordinator but this is not enough.
The plan for "Select distinct log_name from log;" would be


RETURN RESULTS TO STORED PROCEDURE
 DISTINCT
  RECEIVE FROM ALL PARTITIONS
   SEND PARTITION RESULTS TO COORDINATOR
    DISTINCT
      SEQUENTIAL SCAN of "LOG"

by the time the distributed distinct executor gets a chance to filter out tuples it's too late because seqscan copies every single tuple to its output table. The problem is that seqscan optimization to bypass the copy is off because of the presense of the inlined projection.  Is this understanding is more or less correct?
 

Somewhat orthogonal to what you are working on, but Inlining is a crappy hack that should go away.

Actually I feel exact the same way about inlined nodes. The executor's code is already complex and adding more inlines won't make it any simpler.
 
It exists because executors have ginormous input/output tables instead of being chained together and outputing a row at a time via iterators.

This is a very nice suggestion! Instead of multiple iterations over the tables have a single loop and pass current tuple from one executor to another. I will definitely look into it.
 
By inlining functionality (and duplicating the code, blech!) we reduce the number of input/output tables and the number of times a tuple has to be copied as well as the amount of temp table space used during a query. That wouldn't be necessary if each executor only materialized a tuple at a time.

Thanks,
Ariel

Thanks,
Mike

Ariel Weisberg

unread,
Mar 27, 2012, 9:02:26 AM3/27/12
to voltd...@googlegroups.com
Hi,

The way to get the optimization back is to put the projection past the distinct aka write a real plan and let the EE do the work. I am not happy about bending the planner around the design of the EE.

However that requires the EE to execute plans in sane way which is a bigger task. Inlined projection is a side effect of the bad executor design that discourages chaining of executors.

Thanks,
Ariel

Michael Alexeev

unread,
Mar 27, 2012, 9:04:53 AM3/27/12
to voltd...@googlegroups.com
Hi Paul,
 
Thank you for such a nice and precise summary!

On Mon, Mar 26, 2012 at 11:02 PM, Paul Martel <pma...@voltdb.com> wrote:
Mike,
Congratulations on your success so far.

I agree with you that while inlining offers some general benefits, its widespread use has lots of implications and complications.
It amounts to a shift in the basic architecture of the executor from a (primarily) bottom-up push-based system to a (primarily) top-down pull-based system.

Actually, executor trees for pure push-based systems tend to look structurally identical to executor trees for pull-based (iterator-based) systems -- they're just processed differently.  Much of the structural complication arises from compromises that blend elements of the two systems. Inlining in a push-based system and pre-batching results in a pull-based system are classic examples. This is where the duplication of effort comes in that Ariel mentioned, as each feature gets packaged in two different styles, one of which appears as a "wart" to the prevailing architecture.

In my opinion, the real architectural issue is too tight a coupling of "specific node functionality" with generic node flow control. In other words, by refactoring the executor node code to separate the various forms of tuple processing from the push-based (or pull-based) boilerplate code that manages where the tuples are coming from and where they are going, we should be able to develop components that work equally well in a push-based, pull-based, or hybrid system without a lot of duplication, and possibly with some reduction/reuse of that boiler-plate.
 
One more issue is the amount of extra copying performed during the hand off from one executor to another specially in case of any type of aggregation. The whole table gets copied at the first step just to be reduced to a few rows at the very next one.
 
Mike

Paul Martel

unread,
Mar 30, 2012, 3:57:04 PM3/30/12
to voltd...@googlegroups.com
Mike,

Thanks for the distinct optimization patches.
Here's my feedback:

In PlanAssembler.java:

The search below the distinct node for the scan where we wish to inline it appears to be a little too aggressive.
My concern is that pushing a distinct down through some nodes can lead to incorrect results.
Particular cases where this can cause problems are some joins and aggregates, though there may be others (?).
It should probably be kept simple, pushing down only through "known safe" cases, otherwise, erring on the side of less performant but correct.
Initially, we could restrict ourselves to applying the push-down to an immediate child scan.
Other cases can be added as we may discover them as "missed easy cases" in testing.

In DistinctPlanNode.java:

The DistinctPlanNode::
resolveIndexes code does not actually have to deal with the child-less (inlined) node case
since the function is only called on non-inlined nodes -- those that are participating in the plan tree.

Instead, for safety, this code could be refactored in the style of ProjectionPlanNode, with the normal (non-inline) entry point making the usual assertions and delegating to a resolveColumnIndexesUsingSchema which is
 an alternative entry (that should be) called by SeqScanPlanNode for inlines (again, in the style used there for inline ProjectionPlanNodes).

In distinctplannode.cpp (vs distinctexecutor.cpp):

The isTupleValueDistinct function added to distinctplannode would be better kept in distinctexecutor,
factored there (rather than duplicated) into a function that could be called both from SeqScanExecutor::p_execute loop in the new inlined case and from within the DistinctExecutor::p_execute loop for the original normal case.

I appreciate how your change follows the precedent set by the inline projection processing -- involving the plan node in the execution process rather than simply using the plan node to initialize a corresponding "inline executor node".
But I think that this approach to inlining projections may have been a mis-step and that it may be time to set a new precedent (and possibly plan to back-patch projection to follow it).

...and seqscanexecutor.cpp:

To "do the right thing" in this respect, we could construct a DistinctExecutor (possibly initializing a SeqScanExecutor pointer member initialized in SeqScanExecutor::p_init?) to do the SeqScanExecutor's dirty work.
The isTupleValueDistinct function could then be made non-static and its signature simplified so that it could operate on an INTERNAL std::set data member (migrated to be a member rather than a DistinctExecutor::p_execute local variable).

I expect that this change idiom -- migrating of state from executor::p_execute locals into executor data members (sometimes initialized in p_init) will become a recurring theme as we migrate away from "push".

BTW, I don't see the rationale for adding the TableTuple* vs. using the existing TableTuple& in SeqScanExecutor::p_execute, so the change to this function could be very small.

Finally, the changeset introduced some whitespace issues (use of spaces on empty lines or at ends of lines)
that cause it to fail our "hard rule coding standard" check-in tests. You can run "ant licensecheck" to quickly see the gory details -- which is run as an early step in "ant check" our standard check-in tests .
Unfortunately, I haven't taken the time to write the sed script that would automatically fix such issues.

Let me know if you have any questions about this.

Also, I'm still digesting your pull execution pseudo code (even while coming up to speed on the existing executor code, and possibly capturing what I'm learning for the developer wiki) -- I expect I'll get back to you in a little while.

Regards,
--paul

Michael Alexeev

unread,
Apr 1, 2012, 9:02:33 PM4/1/12
to voltd...@googlegroups.com
Hi Paul,

Thank you for the great feedback!

In PlanAssembler.java:

The search below the distinct node for the scan where we wish to inline it appears to be a little too aggressive.
My concern is that pushing a distinct down through some nodes can lead to incorrect results.
Particular cases where this can cause problems are some joins and aggregates, though there may be others (?).

I had a 'tunnel vision' that the only way to get around the out-of-memory problem was the distinct rout. There are definitely better ways to solve the problem. Yes, pushing distinct node all the way down is way too aggressive.
 
It should probably be kept simple, pushing down only through "known safe" cases, otherwise, erring on the side of less performant but correct.
Initially, we could restrict ourselves to applying the push-down to an immediate child scan.
Other cases can be added as we may discover them as "missed easy cases" in testing.

I will change it to simply push down over the receive/send node pair for each partition.

In DistinctPlanNode.java:

The DistinctPlanNode::
resolveIndexes code does not actually have to deal with the child-less (inlined) node case
since the function is only called on non-inlined nodes -- those that are participating in the plan tree.

Instead, for safety, this code could be refactored in the style of ProjectionPlanNode, with the normal (non-inline) entry point making the usual assertions and delegating to a resolveColumnIndexesUsingSchema which is
 an alternative entry (that should be) called by SeqScanPlanNode for inlines (again, in the style used there for inline ProjectionPlanNodes).

Since we are not going to make Distinct node inline this change won't be necessary. I will revert it to the original version.


In distinctplannode.cpp (vs distinctexecutor.cpp):

The isTupleValueDistinct function added to distinctplannode would be better kept in distinctexecutor,
factored there (rather than duplicated) into a function that could be called both from SeqScanExecutor::p_execute loop in the new inlined case and from within the DistinctExecutor::p_execute loop for the original normal case.

I appreciate how your change follows the precedent set by the inline projection processing -- involving the plan node in the execution process rather than simply using the plan node to initialize a corresponding "inline executor node".
But I think that this approach to inlining projections may have been a mis-step and that it may be time to set a new precedent (and possibly plan to back-patch projection to follow it).

It took me a while to find a 'right' place to add distinct code there and seqscan executor is the simplest one out of existing scan executors. One more con against the inline solution.
 
...and seqscanexecutor.cpp:

To "do the right thing" in this respect, we could construct a DistinctExecutor (possibly initializing a SeqScanExecutor pointer member initialized in SeqScanExecutor::p_init?) to do the SeqScanExecutor's dirty work.
The isTupleValueDistinct function could then be made non-static and its signature simplified so that it could operate on an INTERNAL std::set data member (migrated to be a member rather than a DistinctExecutor::p_execute local variable).
I expect that this change idiom -- migrating of state from executor::p_execute locals into executor data members (sometimes initialized in p_init) will become a recurring theme as we migrate away from "push".
 
If we move toward the pull model without intermediate tables I don't think we would need DistinctExecutor within the SeqScan one in the first place.

Just curious, since executors are statefull,  does it mean one transaction at a time per server thread? How does it scale if the number of clients goes up? Where does the synchronization between multiple threads (transactions) happen?
 

BTW, I don't see the rationale for adding the TableTuple* vs. using the existing TableTuple& in SeqScanExecutor::p_execute, so the change to this function could be very small.

Depending whether you have projection node or not,  you insert either tuple (from the iterator) or tempTumple (the tuple after the projection applied) into the output table. I wanted to eliminate duplicate IF statement there.

Finally, the changeset introduced some whitespace issues (use of spaces on empty lines or at ends of lines)
that cause it to fail our "hard rule coding standard" check-in tests. You can run "ant licensecheck" to quickly see the gory details -- which is run as an early step in "ant check" our standard check-in tests .
Unfortunately, I haven't taken the time to write the sed script that would automatically fix such issues.

Sorry about that. I knew this wouldn't be the final patch and haven't ran it. Anyway, should have done it. Out of curiosity, what is the history behind this rule?


Let me know if you have any questions about this.

Also, I'm still digesting your pull execution pseudo code (even while coming up to speed on the existing executor code, and possibly capturing what I'm learning for the developer wiki) -- I expect I'll get back to you in a little while.

Looking forward to it. I have to admit I haven't looked at all executors whether the pull approach would work for them or not yet. Only ones that participate in SELECTs. I will investigate the remaining ones as well.


Mike
Reply all
Reply to author
Forward
0 new messages