RxJava and distributed streams

1,165 views
Skip to first unread message

Todd Nine

unread,
Feb 20, 2015, 3:06:17 AM2/20/15
to rxj...@googlegroups.com
Hey guys,
  We've been using Rx Java very successfully, and we're incredibly happy with it.  Fantastic work!  Now we've hit a wall, and we need to expand.  We're to the point we need to perform distributed tree processing through our functional pipeline.  On a single node, it can easily be expressed as this.


Observable.create(root node).flatmap(applications ).flatmap(collections ).flatmap(collection shards).map(do some work).subscribe().

Where we're emitting items as we process our tree.  Rather than a local flatmap, we'd like to distribute this work across our cluster.  Has anyone performed this with Rx, and what framework did you use?  I know there's some Rx Netty floating around, but I wasn't sure how stable that was, or if there's a better solution.

Thanks!
Todd

Ben Christensen

unread,
Feb 26, 2015, 1:48:42 PM2/26/15
to Todd Nine, rxj...@googlegroups.com
We have built a system like this at Netflix, but aren’t quite ready to open source it. It is a distributed, Rx stream processing system built as an application framework on top of Mesos. 

RxNetty is the networking library we use for it. It is production worthy, but the APIs and functionality are not yet stable. We are already discussing v2 of the RxNetty work and it will almost certainly be called something else and live elsewhere as it outgrows what we’ve started with RxNetty. 

Related to this is a post I just did here: https://github.com/reactive-streams/reactive-streams-io/issues/1

-- 
Ben Christensen
+1.310.782.5511  @benjchristensen

Todd Nine

unread,
Mar 2, 2015, 8:56:08 PM3/2/15
to rxj...@googlegroups.com, tn...@apigee.com
Thanks for the response Ben.  Out of curiosity, why did you guys ultimately choose to implement your own mechanism?

We're doing basic evaluations of Akka and Vertx event bus as our distribution mechanism.  Vertx seems out because it lacks the state you describe.  Akka claims to support stateful actor trees, but I haven't tried it yet.  Did you simply not like the actor model, or was there a functionality limitation you needed to overcome?  The semantics you describe in the ticket are exactly the kind of functionality we need in Usergrid to bring it to a fully distributed event database.  I'd love to hear your thoughts. 


Thanks,
Todd

Ben Christensen

unread,
Mar 6, 2015, 3:12:07 PM3/6/15
to Todd Nine, rxj...@googlegroups.com, tn...@apigee.com
Here are things we considered:

  • Real-time or historical streams:  capability to process real-time streams of events, or replay historical streams using virtualized time.

  • Progressive processing:  A job is a progressive set of stages.  Each stage performs a series of transformation to the event stream.  Stages are the basic unit for scheduling and fault tolerance.  Stages are composed together to form a job.  

  • Push, pull, or mixed:  support a push or pull approach to event streams.  A single job can be configured to support a mix of push or pull for each progressive job stage. Backpressure must propagate across all stages of execution with strategies for hot push data sources.

  • High order functions:  The API makes use of high order functions to express job behavior.  The Reactive Extensions (Rx) library is used for composability to process asynchronous streams.

  • Polyglot support:  core is written in Java, but jobs can be expressed in any language that can be executed on the JVM.

  • Cloud native:  dynamically scale both jobs and the cluster.  Where job scaling is a function of the event stream and cluster scaling is a function of the jobs scheduled.

  • Fault tolerant scheduling:  provide a fault tolerant scheduling service.  All running jobs and jobs scheduled to be executed are resilient to master and worker failures.  

  • Job isolation:  job runs in isolation from other jobs.  Resource isolation includes: CPU, memory, disk, and network I/O.

  • Stream locality:  Jobs sharing a common data source are favored by the scheduler to be scheduled on the same physical machine.  This limits the amount of data duplication into the cluster.

  • User interface:  provide a rich user interface for job creators and cluster administrators.  Job creators can create, submit, kill, or get insights into a running job.  Administrators can get insights into running jobs, cluster health, etc.

  • Reusable sources and sinks:  Job ‘source’ (input) and job ‘sink’ (output) implementations can be reused and extended.  The framework provides some low level source and sink implementations such as HTTP and Kafka.

  • Lightweight job operations:  Job operations such as submit and kill require no explicit changes to the cluster infrastructure.  Design to support a ‘REPL style’ approach to job creation and execution.

  • Artifact specification:  The specification provides a clean way for jobs to define and package dependent libraries.  The specification also provides meta-data to the user interface to expose job details such as version, description, dependencies, etc.
  • Parameterized Jobs: Allow a job to be defined and made accessible via a URI which can be subscribed to with parameters which will deploy the job on-demand with the runtime parameters. This allows reuse of common stream processing behavior by users with different arguments.
  • Multi-tenant: Related to cloud-native but a key design goal itself is to permit many jobs to be running on the same cluster, scheduled and auto-scaled individually across the same base infrastructure.

-- 
Ben Christensen
+1.310.782.5511  @benjchristensen

Reply all
Reply to author
Forward
0 new messages