Map-Reduce: Please consider adding the option to prevent infinite loops dynamically based on my working example attached

116 views
Skip to first unread message

Shlomo Ildar Kazakov

unread,
Dec 13, 2020, 5:14:55 PM12/13/20
to RavenDB - 2nd generation document database
The static checks on cyclic dependences between indexes are aimed to prevent infinite loops in the runtime. But the restriction on static cyclic dependences between indexes is too strict, as it requires to recalculate every new state from scratch, not allowing to reuse the previous state value.

Meanwhile, this static kind of restriction could be accompanied with the option of dynamic infinite loops prevention. This would allow us to cardinally extend the powerful Map-Reduce feature with the ability to implement really complex logic based on setting dependencies on the previously calculated and saved states in the elegant way, easy to develop, support and improve code. I think, it would be great to provide a developer with the ability to combine the approaches.

As for now the restriction on static cyclic dependences between indexes can be overcome with Subscriptions/Changes API, however, at the cost of extra code (out of RavenDB), significant performance losses and additional traffic and resources. Another way is switching off the restriction in the local build as I have made for this example.

Adding this feature is rather easy and is going to cardinally extend the RavenDB capabilities and development experience. Please look at the attached 'Shares1' experimental database dump. There is an example task implemented in the two ways:
  • Without static cyclic dependences between indexes. 
  • With static cyclic dependences between indexes (I have turned off the static cyclic dependences checks in my local build of Raven.Server). Actually, in the runtime there are no infinite loops, indexes work nicely. Actual dependences are between the documents, produced by the indexes. Dependences between them are not cyclic, but have a spiral shape. This approach has some advantages and there are ways to prevent infinite loops dynamically. I will describe these below.
The example task is rather simple. It is on the periodic shareholders share calculation. We have opening investment amounts ('Opening Balances' collection) at participant accounts and their corresponding shares (account's investment amount / sum of investment amounts on all accounts ('totalAmount')). Then we have transactions of investments made by accounts ('Transactions' collection). Each period (month) for each account we should prepare a report with opening and closing amounts and shares (share = amount / totalAmount). Also we should prepare periodic aggregated reports on all accounts (share is to equal to 1.0 in all of them).

The first implementation way is based on the approach on running totals calculation demonstrated by Oren in the webinar on Amberwood case. So all the related indexes in the example are marked with prefix 'Oren/'. It is based on the recalculation from scratch of closings for each (period, account). It required me to implement six indexes (each next one depends on the one or more previous ones), the first four of them are the intermediate auxiliary ones to avoid static cyclic dependencies (maybe you could make less, would be interesting to see):
  1. Oren/AmountsBy/Account/Period: Aggregates amounts by accounts and periods
  2. Oren/ClosingsBy/Account/Period: Calculates running totals by accounts and periods (closing amounts) 
  3. Oren/PeriodCount: period count per period kind (is used by the next index),
  4. Oren/ClosingsBy/Period: Aggregates closing amounts by periods (closing total amounts)
  5. Oren/ReportsBy/Account/Period: Calculates the final reports with shares, amountChange and openings
  6. Oren/ReportsBy/Period: Aggregates final account reports by periods

The second implementation way is of mine reflecting the way this original task has been implemented in Excel. It calculates the documents for a next period based on the documents calculated for its previous period. It required me to have only two indexes depending on each other and one of them even on itself:
  • ReportsBy/Period: Aggregates reports by periods (totals, except for totalAmounts (!), similar to Oren/ClosingsBy/Period) from the next index.
  • ReportsBy/Account/Period: Calculates the final report documents. It depends on its own output collection and for every changed report document it produces the document for the next period (until its start date gets invalid, i.e. out of overall period latest boarder, which stops the index calculation loop), setting its opening equal to the closing of the entry report. It also uses total amounts from the previous index but this does not cause an infinite loop: amount (here) -> total amount (prev index) -> share (here) -> total share (prev index).

The dynamic prevention of infinite loops

RavenDB can already detect static cyclic dependencies so it could provide the option to prevent infinite loops dynamically  for the every detected cyclic dependency by configuring and checking the maxim number of calculations of indexes in the loop.

Both static dependences between indexes described above have predictable number of iterations which can be calculated for the modified document (using its fields) after the first iteration(when RavenDB sees that next index (maybe the same) in the static cyclic dependency path is to be updated):
  • For the first dependency: For every period.kind of the output document:
from Periods 
group by kind
select count(dateStart) As maxIndexCalculationsInTheLoop
  • For the second dependency: It is just static 4 index calculations in the loop.
In case this exact criteria is absent, static limitations with default values could be configured and used for dynamic checks.


Let's compare the two implementations
  • Code
    • The first way uses Map-Reduce for Accounts and Periods dimensions in Oren/ClosingsBy/Account/Period index (i.e. their cartesian product, Periods dimension to recalculate closings from scratch). So here we have pure Map-Reduce approach.
    • The second way uses Map-Reduce for Accounts dimension only (as it relies on the saved state for the previous period). So here we have the combination of the Map-Reduce and Dependency-on-the-saved-state approaches. 
    • The second way has only two final indexes instead of six of the first one. The first way requires extra 4 auxiliary indexes to avoid static cyclic dependencies. 
    • The second way is straightforward, can be directly reproduced from the Excel prototype, requires much less code, is much easier and faster to develop, support and improve.
    • E.g. I have refactored the second implementation to improve documents structuring. But I haven't done for the first one as it is much more time consuming.
    • This example is rather simple. More complex business logic is going to make the second way even more preferable.
  • Productivity 
    • Time/CPU
      • Modifying a transaction in the latest period
        • The second way is about 3 times faster (comparing the summary length of colored blocks) and, I guess, requires less CPU, as it requires less indexes to calculate and much less Map-Reduce entries (no extra Periods dimension).
PerfLastPeriod.png
      • Adding a new latest period with new transactions
        • The second way is about 2 times faster (comparing the summary length of colored blocks) and, I guess, requires less CPU, as it requires less indexes to calculate and much less Map-Reduce entries (no extra Periods dimension).
Perf - AddingLatestPeriod.png
        • Expanded view:
Perf - AddingLatestPeriodD1.png
Perf - AddingLatestPeriodD2.png
        • Details are in the attached 'indexPerf - Adding new latest period with transactions.json' file

      • Modifying a transaction in the earliest period
        • The first way about 1.5 is faster (comparing the summary length of colored blocks), as I guess, it is performed in parallel for the periods, while the second way is sequential and requires indexes saves for every period.
PerfFirstPeriod.png
        • Expanded view:
PerfFirstPeriodD1.png
PerfFirstPeriodD2.png
        • Details are in the attached 'indexPerf - Updating transaction in the first period.json' file
      • Memory / Disk space
        • I guess, the first way should take much more memory and disk space for auxiliary indexes and, especially, to keep Map-Reduce tree for Oren/ClosingsBy/Account/Period index, as it requires additional Periods dimension.
Thus, the second way, combining Map-Reduce with dependencies on the earlier calculated and saved states, looks more preferable to the first way from the following point of views:
  • development simplicity and speed
  • memory and disk space consumed
  • runtime performance in the most common scenarios of changes in the latest period (changes in the past periods are rare in many cases)
So, I think, it would be great to enrich the powerful Map-Reduce feature with the option to prevent infinite loops dynamically in addition to the current static checks. This would allow developers the combined use of the Map-Reduce and Dependency-on-the-saved-state approaches, and cardinally extend RavenDB's capabilities and improve development experience for really complex business logic.

For the first step, it would have nearly zero cost just to allow to switch off the static checks for some indexes as the experimental feature. Afterwards, the dynamic checks could be introduced (as described above). Optionally, I suppose, some optimisations could be done later to increase performance in looped index calculations (e.g. to make less writes to disk of indexes' output collections while they are modified in the loop).

What do you think about this option?

Thanks,
Shlomo
Dump of Shares1 2020-12-13 22-28.ravendbdump
indexPerf - Shares1.zip

Oren Eini (Ayende Rahien)

unread,
Dec 16, 2020, 6:42:54 AM12/16/20
to ravendb
Dynamic recursion checks are something that I'm really not eager to do. They are complex
You have to keep track of the data flow in some manner through map/reduce operations, which means that you can't reduce the data.

How did you create the dump file? I tried importing that, but it fails because of the reference loop.
I assume that you are using a custom version of RavenDB here?

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ravendb/0123b372-ee8a-4d88-a1dc-5960821769e9n%40googlegroups.com.


--
Oren Eini
CEO   /   Hibernating Rhinos LTD
Skype:  ayenderahien
Support:  sup...@ravendb.net
  

Shlomo Ildar Kazakov

unread,
Dec 16, 2020, 11:28:43 AM12/16/20
to rav...@googlegroups.com
Hi Oren,

> How did you create the dump file? I tried importing that, but it fails because of the reference loop.
> I assume that you are using a custom version of RavenDB here?

Exactly :)
> As for now the restriction on static cyclic dependencies between indexes can be overcome with Subscriptions/Changes API, however, at the cost of extra code (out of RavenDB), significant performance losses, and additional traffic and resources. Another way is switching off the restriction in the local build as I have made for this example.
> With static cyclic dependencies between indexes (I have turned off the static cyclic dependences checks in my local build of Raven.Server). 

I have built it to try running the client in the browser with the changes you suggested. So it was not difficult just to switch off the static restrictions in file
You can find its modified version attached. With it, the static cyclic index recursive triggering works just fine and RavenDB shines! :)
And here are the modified binaries of Raven.Server (5.1.2).

> Dynamic recursion checks are something that I'm really not eager to do. They are complex
> You have to keep track of the data flow in some manner through map/reduce operations, which means that you can't reduce the data.

This is so if we try to truly identify infinite loop dynamically. But this is not necessary, and I completely agree with you that it is not to be done. 

But I have proposed another approach that is rather easy to implement without affecting performance and map-reduce functionality at all. This would cardinally extend RavenDB's internal computational capabilities performed very efficiently and easy to develop. Actually, RavenDB has already all of them (like saving map-reduce results as documents and recursive index calculation triggering), but at the moment they are tightly restricted with static checks. All that is needed to unlock the existing great power, is to allow a developer to make those restrictions more flexible.

We can just prevent infinite recursive triggering of statically identified cyclic paths (<Index1 -> Index2 -> ... IndexN, Index1>) by limiting the maximum number of recursive index calculation triggerings on them, i.e. length of the dynamic path of recursively triggered indexes on these static paths. In most cases, this limitation is predictable rather easily (e.g. in the considered example). Recursive index calculation triggering on a static cyclic path is identified easily. We don't even have to track the dynamic path itself, but just count dynamically its length (starting with a given starting index in the static cyclic path) and just stop with the error on exceeding the allowed maximum number (all indexes in the static cyclic path should be turned to error state).

Here is my proposal from the initial email (with minor updates): 

------------------------------------
The dynamic prevention of infinite loops

RavenDB can already detect static cyclic dependencies so it could provide the option to prevent infinite loops dynamically for every statically detected cyclic dependency by configuring and checking the maximum number of calculations of indexes in the loop.

Both static dependencies between indexes described above have a predictable number of iterations which can be calculated for the modified document (using its fields) after the first iteration(when RavenDB sees that next index (maybe the same) in the static cyclic dependency path is to be updated):
  • For the first dependency: For every period.kind of the output document:
from Periods 
group by kind
select count(dateStart) As maxRecursiveIndexTriggerings
  • For the second dependency: It is just static 4 index calculations in the loop.
In case this exact criteria is absent, static limitations with default values could be configured and used for dynamic checks.

------------------------------------
So RavenDB could just propose a developer for every statically identified cyclic index recursive triggering path (assigned with id staticCyclicRecursionPathId) either to eliminate it (the default option as we have now) or to configure the dynamic limitation on the number of recursive index triggerings on the path ( in the considered example for the identified static path <ReportsBy/Account/Period, ReportsBy/Account/Period>):
  • To set the starting index in the static path, in the considered example ReportsBy/Account/Period.
  • To configure:
    • the optional DynamicRecursionLimitationIndex (named like DynamicRecursionLimitationIndex/<staticCyclicRecursionPathId>) for defining the maximum number of recursive index calculation triggerings on the static path, in the considered example:
    • from Periods 
      group by kind
    • select count(dateStart) As maxRecursiveIndexTriggerings

    • the optional DynamicRecursionLimitationQuery with arguments, in the considered example ($p0 is its argument): 
      from DynamicRecursionLimitationIndex/<staticCyclicRecursionPathId>
      where kind = $p0
      select maxRecursiveIndexTriggerings

    • the final dynamicRecursionLimitation function to get for every starting index's output document d the required maximum number from the DynamicRecursionLimitationIndex, in the considered example:
      function dynamicRecursionLimitation(d) {
          return dynamicRecursionLimitationQuery(d.period.kind)
      }

For another static cyclic index recursive triggering path <ReportsBy/Account/Period, ReportsBy/Period, ReportsBy/Account/Period> in the considered example:
    • the starting index is ReportsBy/Account/Period
    • DynamicRecursionLimitationIndex and DynamicRecursionLimitationQuery are not required (empty)
    • dynamicRecursionLimitation
       function dynamicRecursionLimitation(d) {
          return 4
      }

That's all that is needed to unlock the great power of RavenDB's recursive indexing! :)



MapReduceIndex.cs

Oren Eini (Ayende Rahien)

unread,
Dec 20, 2020, 9:40:43 AM12/20/20
to ravendb
inline

 

We can just prevent infinite recursive triggering of statically identified cyclic paths (<Index1 -> Index2 -> ... IndexN, Index1>) by limiting the maximum number of recursive index calculation triggerings on them, i.e. length of the dynamic path of recursively triggered indexes on these static paths. In most cases, this limitation is predictable rather easily (e.g. in the considered example). Recursive index calculation triggering on a static cyclic path is identified easily. We don't even have to track the dynamic path itself, but just count dynamically its length (starting with a given starting index in the static cyclic path) and just stop with the error on exceeding the allowed maximum number (all indexes in the static cyclic path should be turned to error state).


That would require us to keep track of each individual map/reduce value, unless I miss something?

Consider the case that we have a multi map index, like so:


from o in docs.Orders
select new {
   RunningSum = o.Amount,
   PeriodTotal = o.Amount,
   Month = o.At.ToString("yyyy-mm-01")
}

from p in docs.NextComputed // <-- output of this index
select new 
{
      p.RunningSum,
      PeriodTotal = 0,
      p.Month
}


// reduce
from r in results
group r by r.Month
for i in Enumerable.Range(2)
select new [] {
   RunningSum = p.Sum(x=>x.RunningSum),
   PeriodTotal = p.Sum(x=>x.PeriodTotal),
   Month = r.Key.AddMonth(i)
}

Gmail code - mind. But it should be clear.

In this case, we output this to NextComputed, but also read from it. 
How would you expect to figure out if there is an infinite recursion or not? 
 
Here is my proposal from the initial email (with minor updates): 

------------------------------------
The dynamic prevention of infinite loops

RavenDB can already detect static cyclic dependencies so it could provide the option to prevent infinite loops dynamically for every statically detected cyclic dependency by configuring and checking the maximum number of calculations of indexes in the loop.

Both static dependencies between indexes described above have a predictable number of iterations which can be calculated for the modified document (using its fields) after the first iteration(when RavenDB sees that next index (maybe the same) in the static cyclic dependency path is to be updated):

That isn't actually true. Consider the case where we introduce an index after some indexes are already indexed, or that indexes may mutate values internally. 
 
  • For the first dependency: For every period.kind of the output document:
from Periods 
group by kind
select count(dateStart) As maxRecursiveIndexTriggerings
  • For the second dependency: It is just static 4 index calculations in the loop.
In case this exact criteria is absent, static limitations with default values could be configured and used for dynamic checks.

Given that an index can perform arbitrary computation, I fear that this means that we have to solve the halting problem to figure it out.
 

Shlomo Ildar Kazakov

unread,
Dec 20, 2020, 6:16:34 PM12/20/20
to rav...@googlegroups.com
> That would require us to keep track of each individual map/reduce value, unless I miss something?

I got your point, you are right here. 

But I am looking for a way to make the great functionality available. So here is my modernized and even simplified suggestion.
My approach is to let a developer analyze the logic he develops and make the right estimations, but to have the functionality in case there is a need for it. And we do not need accuracy here. We need just a reasonable estimation to identify and stop the infinite loop (even with some delay).

First, I assume that recursive triggerings are of higher priority than non-recursive to finalize them before starting new recursions. Thus, we can identify the start of new recursions and their starting documents (modified in the output collection). No more tracking of map-reduce values (output documents) is needed in the considered index loop. Only at the recursion start to calculate the limit.

Then, I suggest the simplified infinite loop criteria that are just a simplification of the previous one: 
  • The maximum length of the recursive path is to be configured not for every static index loop, but only for every non-intersecting subsets of them. It is calculated on the identification of the new index recursion start for every output document modified and then the sum on all of them is to be taken. It is to cover recursions in all of them, including the case of multiple modified output documents.
  • No need for configuring the starting index as it is identified dynamically.

In my Shares example, we have two intersecting static loops. Thus, the maximum length of the recursive path is to be configured once for both of them (the same as it was for ReportsBy/Account/Period index, except for 3*):
      • the optional DynamicRecursionLimitationIndex (named like DynamicRecursionLimitationIndex/<staticCyclicRecursionPathId>) for defining the maximum number of recursive index calculation triggerings on the static path, in the considered example:
        from Periods 
        group by kind
        select count(dateStart) As maxRecursiveIndexTriggerings

      • the optional DynamicRecursionLimitationQuery with arguments, in the considered example ($p0 is its argument): 
        from DynamicRecursionLimitationIndex/<staticCyclicRecursionPathId>
        where kind = $p0
        select maxRecursiveIndexTriggerings

      • the final dynamicRecursionLimitation function to get for every starting index's output document d the required maximum number from the DynamicRecursionLimitationIndex, in the considered example:
        function dynamicRecursionLimitation(d) {
      •     return 3 * dynamicRecursionLimitationQuery(d.period.kind)
        }

    Oren Eini (Ayende Rahien)

    unread,
    Dec 21, 2020, 9:11:35 AM12/21/20
    to ravendb
    inline

    On Mon, Dec 21, 2020 at 1:16 AM Shlomo Ildar Kazakov <shlomo....@gmail.com> wrote:
    > That would require us to keep track of each individual map/reduce value, unless I miss something?

    I got your point, you are right here. 

    But I am looking for a way to make the great functionality available. So here is my modernized and even simplified suggestion.
    My approach is to let a developer analyze the logic he develops and make the right estimations, but to have the functionality in case there is a need for it. And we do not need accuracy here. We need just a reasonable estimation to identify and stop the infinite loop (even with some delay).

    First, I assume that recursive triggerings are of higher priority than non-recursive to finalize them before starting new recursions. Thus, we can identify the start of new recursions and their starting documents (modified in the output collection). No more tracking of map-reduce values (output documents) is needed in the considered index loop. Only at the recursion start to calculate the limit.

    That isn't actually true. We simply write the output documents and then deal with them normally. There isn't anything special about map/reduce like that, simply using two separate features together.
     

    Then, I suggest the simplified infinite loop criteria that are just a simplification of the previous one: 
    • The maximum length of the recursive path is to be configured not for every static index loop, but only for every non-intersecting subsets of them. It is calculated on the identification of the new index recursion start for every output document modified and then the sum on all of them is to be taken. It is to cover recursions in all of them, including the case of multiple modified output documents.
    • No need for configuring the starting index as it is identified dynamically.


    Consider the scenario where we add a new index after the work on another index was done. That is why this cannot happen.
     
    In my Shares example, we have two intersecting static loops. Thus, the maximum length of the recursive path is to be configured once for both of them (the same as it was for ReportsBy/Account/Period index, except for 3*):
      • the optional DynamicRecursionLimitationIndex (named like DynamicRecursionLimitationIndex/<staticCyclicRecursionPathId>) for defining the maximum number of recursive index calculation triggerings on the static path, in the considered example:
        from Periods 
        group by kind
        select count(dateStart) As maxRecursiveIndexTriggerings

      • the optional DynamicRecursionLimitationQuery with arguments, in the considered example ($p0 is its argument): 
        from DynamicRecursionLimitationIndex/<staticCyclicRecursionPathId>
        where kind = $p0
        select maxRecursiveIndexTriggerings

      • the final dynamicRecursionLimitation function to get for every starting index's output document d the required maximum number from the DynamicRecursionLimitationIndex, in the considered example:
        function dynamicRecursionLimitation(d) {
            return 3 * dynamicRecursionLimitationQuery(d.period.kind)
        }

    --
    You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

    Shlomo Ildar Kazakov

    unread,
    Dec 21, 2020, 1:03:29 PM12/21/20
    to rav...@googlegroups.com
    > That isn't actually true. We simply write the output documents and then deal with them normally. There isn't anything special about map/reduce like that, simply using two separate features together.

    I see that as for now recursive triggerings are not of higher priority. It is part of my suggestion in order to support the functionality.
    I just suppose that it could be done without affecting existing functionality and performance.

    > Consider the scenario where we add a new index after the work on another index was done. That is why this cannot happen.

    In the case of recursive triggering being of higher priority, the recursion will finish before adding a new index. So this is just a normal use case.
    If the new index creates a new index loop bundle (of intersecting indexes) or extends an existing one, then just the configuration of the maximum length of the recursive path is to be configured or modified for that index loop bundle.
    What problems could be here in the case of recursive triggering being of higher priority?

    Oren Eini (Ayende Rahien)

    unread,
    Dec 22, 2020, 9:37:46 AM12/22/20
    to ravendb
    inline

    On Mon, Dec 21, 2020 at 8:03 PM Shlomo Ildar Kazakov <shlomo....@gmail.com> wrote:
    > That isn't actually true. We simply write the output documents and then deal with them normally. There isn't anything special about map/reduce like that, simply using two separate features together.

    I see that as for now recursive triggerings are not of higher priority. It is part of my suggestion in order to support the functionality.
    I just suppose that it could be done without affecting existing functionality and performance.

    That would involve a complete re-write of how we handle map/reduce and introduce major complexity to the system. That isn't likely to happen.
     

    > Consider the scenario where we add a new index after the work on another index was done. That is why this cannot happen.

    In the case of recursive triggering being of higher priority, the recursion will finish before adding a new index. So this is just a normal use case.
    If the new index creates a new index loop bundle (of intersecting indexes) or extends an existing one, then just the configuration of the maximum length of the recursive path is to be configured or modified for that index loop bundle.
    What problems could be here in the case of recursive triggering being of higher priority?

    You have to now go back and do the recursion work on all the other indexes as well. 

    In short, this feature suggestion introduce a lot of complexity and can be handled by either using ETL to copy the documents to a new collection (so it would evade the static analysis) or via a subscription.
     

    Shlomo Ildar Kazakov

    unread,
    Dec 22, 2020, 12:35:28 PM12/22/20
    to rav...@googlegroups.com
    I see, ok!

    Me personally, I don't need this dynamic infinite loop prevention. All I need is to have a way to evade the static analysis without coding and performance overhead (comparing to my custom Raven.Server build with switched off static restrictions on index loops).

    Subscription does not meet this requirement.
    Is it possible to achieve this with ETL?


    Oren Eini (Ayende Rahien)

    unread,
    Dec 24, 2020, 6:44:50 AM12/24/20
    to ravendb
    I'm fine with adding an explicit configuration parameter that would disable that. A PR would be welcome.

    Shlomo Ildar Kazakov

    unread,
    Dec 25, 2020, 7:20:32 AM12/25/20
    to rav...@googlegroups.com
    That is just great, thanks a lot!


    Reply all
    Reply to author
    Forward
    0 new messages