Real-time or historical streams: capability to process real-time streams of events, or replay historical streams using virtualized time.
Progressive processing: A job is a progressive set of stages. Each stage performs a series of transformation to the event stream. Stages are the basic unit for scheduling and fault tolerance. Stages are composed together to form a job.
Push, pull, or mixed: support a push or pull approach to event streams. A single job can be configured to support a mix of push or pull for each progressive job stage. Backpressure must propagate across all stages of execution with strategies for hot push data sources.
High order functions: The API makes use of high order functions to express job behavior. The Reactive Extensions (Rx) library is used for composability to process asynchronous streams.
Polyglot support: core is written in Java, but jobs can be expressed in any language that can be executed on the JVM.
Cloud native: dynamically scale both jobs and the cluster. Where job scaling is a function of the event stream and cluster scaling is a function of the jobs scheduled.
Fault tolerant scheduling: provide a fault tolerant scheduling service. All running jobs and jobs scheduled to be executed are resilient to master and worker failures.
Job isolation: job runs in isolation from other jobs. Resource isolation includes: CPU, memory, disk, and network I/O.
Stream locality: Jobs sharing a common data source are favored by the scheduler to be scheduled on the same physical machine. This limits the amount of data duplication into the cluster.
User interface: provide a rich user interface for job creators and cluster administrators. Job creators can create, submit, kill, or get insights into a running job. Administrators can get insights into running jobs, cluster health, etc.
Reusable sources and sinks: Job ‘source’ (input) and job ‘sink’ (output) implementations can be reused and extended. The framework provides some low level source and sink implementations such as HTTP and Kafka.
Lightweight job operations: Job operations such as submit and kill require no explicit changes to the cluster infrastructure. Design to support a ‘REPL style’ approach to job creation and execution.