My problem: I have 10s of millions of rows of time series data that I need to run moving window calculations on (though the groupby separates it into thousands of groups of about 10000 rows). In some cases EMA-like replacements work, but in other cases I need to do things like run multiple variable regressions and just need to use windows.
I've read the previous post on moving windows in scalding:
https://groups.google.com/forum/#!msg/cascading-user/CTYOjlHs6xE/A5TNIMFM_ikJ and found it somewhat confusing. I understand that I can use scanleft to walk through the data and can use an object to accumulate the window. However, it seems like scanleft will save a copy of this object for each value so if my moving window needs, for example, 100 previous values, the naive implementation of an immutable object wrapping a queue would require 100x the space of the original dataset.
What I think that I want is something like a reference to a mutable moving window object parameterized by the data type, so that I'm not storing a copy of it at each stage and its generic enough to reuse for different types of data. So for example:
.scanLeft( ('values, 'otherValues, 'moreValues) -> ('calculationResult, 'movingWindowObjectReference) ) ((Double.NaN, movingWindowObjectConstructor[(Double, Double, Double]))
{ (previousCalcs : (Double, movingWindowObjectReferenceType[Double, Double, Double]), currentValues : (Double, Double, Double) ) =>
{
val movingWindowObjectReference = previousCalcs._2
movingWindowObjectReference.enqueueValues(currentValues)
val calculationResult = Math.doSomeCalculations(movingWindowObjectReference.getWindowedValues)
(calculationResult, movingWindowObjectReference)
}
}
Can anyone offer suggestions for how to get this done space efficiently? I'm also very open to alternative solutions or comments that I'm completely off base with my proposal. It seems that the scala concept of "sliding" is roughly what I want here, but that doesn't seem to be present in Scalding.
Thanks in advance for any help!
-Dan