What is the difference between S4 and Twitter Storm projects?

1,405 views
Skip to first unread message

yasith tharindu

unread,
Oct 10, 2011, 5:29:37 AM10/10/11
to s4-pr...@googlegroups.com
Any body have a idea about the $Subject ?

--
Thanks..
Regards...

OG

unread,
Oct 11, 2011, 9:54:31 AM10/11/11
to s4-project

Leo Neumeyer

unread,
Oct 11, 2011, 1:48:39 PM10/11/11
to s4-pr...@googlegroups.com, s4-...@incubator.apache.org
I think that both projects share similar goals. I haven't done a detailed study of Strom so I can't give you a very detailed comparison. However, it is great to see more of these systems so we can keep learning from each other. 

Let me summarize the approach we are taking in S4-piper which is the next release that will be done in the Apache Incubator. (Here for now: https://github.com/leoneu/s4-piper)

The high level goal is unchanged: distributed stream processing and ease to use. This implies hiding the distributed nature of the application from the user. 

We basically embrace symmetry, strict OO design, no multithreading required, the Prototype design pattern, extensive use of the Java platform, static typing, no string literals, no large XML configuration files.

- Symmetry: all nodes in a cluster are identical. Makes management, configuration and deployment easier. Fewer parts that can break.

- OO Design: not much I can say here except that processing elements (PEs) are objects, streams are objects, and events are objects. The core API is very small and works in a coordinated way to build the platform.

- In most cases, application developers don't need to write any multithreaded code. By default PE threads are synchronized.

- Prototype pattern design: once a PE is configured, we instantiate a PE object that we call a PE prototype. The PE prototype has the PE type and the configuration. PE instances are cloned from the prototype and keyed using a (key_finder, key_value) tuple. The key_finder is a function that extracts a value from the event and that value is the key for the PE instance. This guarantees that we have a unique instance of a PE for a given key. This can be used in a single node or in a cluster with multiple nodes. For the app developer, this means that the app is a graph of prototype PEs connected buy streams. The key logic resides in the stream object. Building an app is as easy as drawing a graph. We have a basic API but people will be able to build tools on top of it (GUI, DSLs, etc.) We are planning to provide an API on top of Guice so we can take advantage of Guice's dependency injection under the hood.

- Using the Java platform. We feel that having a consistent platform that is familiar to many developers helps make things easier. While dealing with some complexity and less known frameworks under the hood (Generics, Guice, OSGI for dynamic app loading, ..) we try to leave the external API as pure as simple as possible. We also want to take advantage of new language feature coming in Java 7,8... We also have the option to easily add a Scala API. 

- Static typing: I always thought that dynamic languages were much nicer but I can see the trade offs. In a large system with multiple programmers and modules, having static typing with a good OO design makes understanding and maintaining the code much easier. To accomplish this we use Guice to replace XML config files which are nightmare if you had lots of PEs to configure. I'd rather use a Java API to build the graph. We eliminate the use of string literals to find objects, this is cleaner, more efficient, easy to refactor, and we can easily find things in Eclipse. For remote objects, we send POJOs around and pretty much hide all the details. The stream object is smart enough to pass events to the comm layer when  the target PE instance is in another JVM. The event magically arrives in the stream's queue at the destination. 

- In terms of pluggability, S4 can use any comm protocol and serialization scheme. You just need to implement it in the comm layer. Right now we have a simple UDP implementation and a TCP implementation using Netty. We tested ZMQ but saw no performance advantage and no need for it since everything is Java in our case. However, I may be missing something. Netty gives you the flexibility to implement alternate protocols.

- In terms of losing events, it is not a matter of losing or not losing, you often need to plan how to handle peak traffic in a real-time system, otherwise it is probably over engineered. Our plan is to provide an API for load shedding so app developers can decide what to do when the queues are getting full. Some may want to throw events, others may want to switch to a less computationally expensive algorithm.  I prefer to think in a probabilistic way, that is, event loss or load shedding are part of the algorithm and is evaluated in terms of a performance metric as it is done in a communications system.

 - I also experimented on how to run batch tasks in S4. Unlike streaming, in batch mode, we can afford infinite delays and prevent event loss. This is important for offline tasks and for debugging. For this we need to support blocking queues in the stream so when a queue is full, it blocks instead of losing events. In a cluster, this means, that blocking needs to propagate upstream across nodes. The downside is that you may end up having deadlocks if the graph has cycles but that is a whole other story.  In the model example I push events at a high data rate into S4 to train a model. The exact same model is also used at run time. This brings consistency for training and execution because it uses the same code base and the same framework. It doesn't have all the functionality of Hadoop but can be used for some tasks. 

Hope this helps for now. 

-leo

AliKevin

unread,
Oct 20, 2011, 2:17:32 AM10/20/11
to s4-project
as i know ,there are some diffrences between S4 and Storm.
Specifically As Follows:
1.strom deploy a new application by apache shrift,Support hot
deployment,but s4 now deploy a new application by config spring
configuration file.
2.one detail diffrence is strom support difine bolt's instance conut
by config. but s4 will auto create N instances, N is node number of s4
cluster.
3.s4 supply adapter api & driver api,so we can develop client app on
s4's way. but strom haven't supply client api.
so i only know these.and i got these infomation by read doc,so if
there are some mistakes,please tell me:)
best regirds!
> > Regards...- 隐藏被引用文字 -
>
> - 显示引用的文字 -

gaurav parashar

unread,
Jan 15, 2015, 12:25:22 AM1/15/15
to s4-pr...@googlegroups.com, s4-...@incubator.apache.org
I would like to know whether load shedding is available in other real time systems like storm , spark or kafka?
If not then why they have not implemented it in these systems?
Regards
Reply all
Reply to author
Forward
0 new messages