Two basic questions -- about spout and tasks

1,012 views
Skip to first unread message

Lin Ma

unread,
Nov 4, 2012, 10:29:40 AM11/4/12
to storm...@googlegroups.com
Hi guys,

I am a new Storm user and learning this tutorial => https://github.com/nathanmarz/storm/wiki/Tutorial, two basic questions.

1. Supposing I have a spout, and the spout emits two different types of output (for different types I mean format, meaning, schema are different), and one type of bolt A consumes first type of output of the spout, and the other type of bolt B consumes the other type of output of the spout. How to write such topology, so that the right type of output is sent to right type of bolt? BTW: a stupid implementation is, send all types of output from spout to all types of bolt, and let bolt filter out unnecessary traffic, which is not optimum since additional traffic/filtering overhead;
2. The term "task" or "tasks" are mentioned in the twiki. I think it is refers to specific thing a bolt or a spout is doing in a topology. Is that correct?

thanks in advance,
Lin

Anuj Kumar

unread,
Nov 4, 2012, 12:45:08 PM11/4/12
to storm...@googlegroups.com
On Sun, Nov 4, 2012 at 8:59 PM, Lin Ma <lin...@gmail.com> wrote:
Hi guys,

I am a new Storm user and learning this tutorial => https://github.com/nathanmarz/storm/wiki/Tutorial, two basic questions.

1. Supposing I have a spout, and the spout emits two different types of output (for different types I mean format, meaning, schema are different), and one type of bolt A consumes first type of output of the spout, and the other type of bolt B consumes the other type of output of the spout. How to write such topology, so that the right type of output is sent to right type of bolt?

Use different Stream IDs and emit the tuple from the spout to the right streamID. Also, make sure that your bolt subscribe to one or more stream IDs on which the spout is emitting the tuples. That way you can direct the tuples to only the bolts that are subscribed to the stream ID on which the tuples are being emitted by the Spout. Go through Data Model- https://github.com/nathanmarz/storm/wiki/Tutorial

 
BTW: a stupid implementation is, send all types of output from spout to all types of bolt, and let bolt filter out unnecessary traffic, which is not optimum since additional traffic/filtering overhead;

Sending all the tuples to all the bolts is not advisable. You should choose the right stream IDs and accordingly structure your topology. However, you may have a requirement where your bolt may be listening to more than one stream IDs. In that case, you can use the method tuple.getSourceStreamId() in the execute method of your bolt to determine the stream ID of the tuple and take the action accordingly. See- http://nathanmarz.github.com/storm/doc/backtype/storm/tuple/Tuple.html
 
2. The term "task" or "tasks" are mentioned in the twiki. I think it is refers to specific thing a bolt or a spout is doing in a topology. Is that correct?

Task corresponds to actual data processing done by the bolt. Michael has provided very good explanation of the components including a Task (latest release)- http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
 

thanks in advance,
Lin

You can also start taking a look at new high-level abstraction called Trident- https://github.com/nathanmarz/storm/wiki/Trident-tutorial

- Anuj

Lin Ma

unread,
Nov 5, 2012, 8:04:12 AM11/5/12
to storm...@googlegroups.com
Thanks Anuj, you answered all of my questions except the one regarding to spout sending multiple streams. I am new to how to write a spout with multiple streams. Is there an example or any pseudo code, which shows (1) how to write a spout with multiple streams, and (2) how to write a bolt which subscribe to the spout in the topology, but only receives interested stream?

BTW: supposing I have a Spout S1 which emits two streams, Bolt A subscribes to one of the streams, and Bolt B subscribes to the other stream. Is it one topology (S1 + A + B), or two topologies (S1 + A, and S1 + B)?

regards,
Lin

Anuj Kumar

unread,
Nov 5, 2012, 12:14:31 PM11/5/12
to storm...@googlegroups.com
On Mon, Nov 5, 2012 at 6:34 PM, Lin Ma <lin...@gmail.com> wrote:
Thanks Anuj, you answered all of my questions except the one regarding to spout sending multiple streams. I am new to how to write a spout with multiple streams. Is there an example or any pseudo code, which shows (1) how to write a spout with multiple streams, and (2) how to write a bolt which subscribe to the spout in the topology, but only receives interested stream?

BTW: supposing I have a Spout S1 which emits two streams, Bolt A subscribes to one of the streams, and Bolt B subscribes to the other stream. Is it one topology (S1 + A + B), or two topologies (S1 + A, and S1 + B)?

You need only one topology. I would recommend going over the tutorial wiki page created by Nathan- https://github.com/nathanmarz/storm/wiki/Tutorial and some of the concepts- https://github.com/nathanmarz/storm/wiki/Concepts

Here is a simple approach that you can adopt-

Spout Implementation
-----------------------------------
In the declareOutputFields method of Spout implementation, declare two streams that the spout can emit -

@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
      declarer.declareStream("StreamA", new Fields("fieldA"));
      declarer.declareStream("StreamB", new Fields("fieldB"));
}

Now in the nextTuple method of the Spout you can emit tuples on both of these streams-

@Override
public void nextTuple() {
      // your logic
      // emit
      collector.emit("StreamA", new Values("Tuple for Bolt A"));
      collector.emit("StreamB", new Values("Tuple for Bolt B"));
}

In your topology class just link the bolts implementation (for example, BoltA and BoltB) with the Spout-

public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException, InterruptedException{
// create a topology builder
TopologyBuilder builder = new TopologyBuilder();
// set the spout
builder.setSpout("S1", new SpoutS1(), 1);
        // add the bolts
        builder.setBolt("A", new BoltA(), 1).shuffleGrouping("S1", "StreamA");
        // add a new bolt that connects to same Spout ID but different stream ID
        builder.setBolt("B", new BoltB(), 1).shuffleGrouping("S1", "StreamB");
        // rest of the configuration
        ...
}
 
You can use different grouping schemes. For details, see- Streams Grouping topic here- https://github.com/nathanmarz/storm/wiki/Concepts

I haven't run this code but I hope that you will be able to understand the overall idea here. There are a number of implementations in the storm-contrib and storm-starter projects that you can go through- 

- Anuj

Lin Ma

unread,
Nov 5, 2012, 8:06:47 PM11/5/12
to storm...@googlegroups.com
Thanks Anuj, my question is answered. Have a good day.

regards,
Lin

tcardo

unread,
Nov 7, 2012, 10:05:06 AM11/7/12
to storm...@googlegroups.com
Hi,

I use this topic to ask a slightly different question :
I have one spout and two bolt. Each bolt consumes the same stream (the spout) one bolt is very fast (bolt A), the other one very slow (bolt B). How storm manage this difference? 
From my tests, it seems that Bolt "A" process all its tuples and then wait for Bolt "B" and so on... Is it correct?
Anyone could explain to me how storm is working in this specific case 

Thanks a lot!

tcardo

unread,
Nov 9, 2012, 1:53:30 AM11/9/12
to storm...@googlegroups.com
anybody can help me...?

Max Ivanov

unread,
Nov 9, 2012, 12:25:02 PM11/9/12
to storm...@googlegroups.com
Search for "throttle", there is a nice explanation from nathanmarz  from 8th of November.

On Friday, 9 November 2012 06:53:30 UTC, tcardo wrote:
anybody can help me...?

Reply all
Reply to author
Forward
0 new messages