tldr : Maybe a good place to start would be Cascading for Storm?
This question was asked the other day on the Cascading group as well:
http://groups.google.com/group/cascading-user/browse_thread/thread/8f80ca8a81cd6038
First, I am a Cascading novice, and of course a Storm novice, so I may
be way off base in my thinking, but I think it is definitely worth a
discussion, as it feels like these higher level abstractions are
imminent, on the tip of the tongue, ready to be teased out.
Cascading itself might be a good model or starting point as it acts as
a kind of compiler for stream transformations where the assembly
language or AST is an object graph of data processing operations.
As you all are probably aware, on the edge version of Cascading Hadoop
has been completely decoupled from the core (see the link above), so
it seems to be standing ready to be adapted to different stream
processing platforms such as Storm. Of course some of the current (and
central) building blocks prominent in Cascading, such as joins,
sorting and maybe buffers as they currently exist, just aren't
appropriate for real time use as has been discussed.
But Cascading also has sub-assemblies which allows for composition of
lower level pipe components and could serve as a form of polymorphism
allowing higher level outcomes or algorithms to be described and
configured in a declarative way but be implemented behind the scenes
as a pair of sub-assemblies composed of batch primitives on the one
hand and stream primitives (which obviously don't exist in Cascading
yet) on the other. The batch implementation may be an exact version of
an algorithm, and the stream version an approximation, as appropriate.
Above Cascading (as it currently exists anyway) a new kind of
"Planner" or could interpret the higher level system definition and
emit equivalent stream/batch sub-assemblies based on the specified
target. These would then be further interpreted by an appropriate
platform specific Cascading planner to create parallel map/reduce jobs
or topologies. But that would be down the road a bit.
I don't think it is necessary to envision a general purpose, one size
fits all higher level abstraction, which might be very difficult to
achieve via thought experiment. I think a good first step might be to
start work on Cascading, or Cascading style real time stream
processing primitives that can be composed and used to automate the
construction of topologies in the same way Cascading currently does
for batch jobs. There are a lot of differences, but also a lot of
overlap.
This would be useful just on its own (at least in my imagination), but
it would also provide a basis for physically experimenting with higher
level abstractions that can be compiled automatically to either
paradigm and over time the higher level abstraction(s) applicable to a
wide range of applications might naturally emerge.