Termination of topology at the completion of processing a finite stream of input

745 views
Skip to first unread message

Sriram

unread,
Feb 15, 2012, 2:05:34 AM2/15/12
to storm-user
I am evaluating storm for my project. As I understand: Topology
processes messages forever. But I have a requirement, wherein there is
a finite stream of input and the final stage bolt must write the
output when all the tuples have completed processing in the system.
For e.g. imagine a finite stream of sentences. The task is to persist
the word-count to disk when the processing on all the tuples has
completed.

How do we know when all tuples have finished processing? I was
thinking of having another bolt keep track of the count for every
stage [bolts at each stage transmit the count] and when the counts
match for the last stage, send a message to commit. Is there a better
way to do it? In general, if we have to perform a Map-Reduce operation
on a finite stream, we will need this.

Thanks,
Sriram

Ken.Zhang

unread,
Feb 15, 2012, 2:46:46 AM2/15/12
to storm...@googlegroups.com

sounds like batch processing, i think maybe something like hadoop meets ur requirements better.

Sent from my Nexus

nathanmarz

unread,
Feb 15, 2012, 2:52:28 AM2/15/12
to storm-user
This is essentially what DRPC and transactional topologies provide
you. They use CoordinatedBolt for determining how many tuples were
sent where and when a task has received all the tuples for a
particular batch.

If you're just dealing with a large amount of finite input, a system
like Hadoop is a better choice because it provides fine-grained fault-
tolerance within the computation of a single job.

If you just want to process "small" batches of messages (you don't
mind having to restart the entire computation on the batch from
scratch), then Storm works fine. You should be able to use
transactional topologies for this: https://github.com/nathanmarz/storm/wiki/Transactional-topologies

JWellington

unread,
Aug 29, 2012, 3:12:13 PM8/29/12
to storm...@googlegroups.com
Hey there,
I have a general disdain for the Hadoop Framework--don't get me started on the InputFormat.
With the introduction of Trident, is it possible to signal the termination of a Data Source from a spout?
I would like this feature of auto-termination. There are a lot of streams out there which are not everlasting.
Thanks!

Nathan Marz

unread,
Aug 29, 2012, 3:15:23 PM8/29/12
to storm...@googlegroups.com
From the spout coordinator, you could easily have it take some action when there's nothing left in the stream (such as connect to Nimbus and terminate the topology).
--
Twitter: @nathanmarz
http://nathanmarz.com

Juan Moreno

unread,
Aug 29, 2012, 3:18:27 PM8/29/12
to storm...@googlegroups.com
This looks like a good idea Nathan, thanks.
That still seems like a workaround to me though ;-)
I would like to request an end-of-stream feature =)

Just to make sure I understand what you are suggesting, is the Spout Coordinator something available in regular Streaming Topologies as well?
--
Juan Wellington Moreno
Software Engineer
Potomac Fusion
7230 Lee Deforester Drive, Suite 100
Columbia, MD 21046
Work Direct: (410) 794-9017

Nathan Marz

unread,
Aug 29, 2012, 3:20:59 PM8/29/12
to storm...@googlegroups.com
The coordinator is only a Trident thing.

In a regular topology, you can take some other strategy to coordinate the end of stream. Perhaps you synchronize through ZK or something.

This isn't going to be a Storm feature, since you can so easily implement this as part of a Trident spout and there is nothing special about doing this that requires it to be part of the core.

Vaibhav Puranik

unread,
Aug 30, 2012, 12:18:01 PM8/30/12
to storm...@googlegroups.com
Nathan,

It would be nice to make it (Shutting down topology when a stream ends) easier or more obvious so that people can start using storm for batch processing too.
I think some discussion in common patterns document might be sufficient for now.

Regards,
Vaibhav
GumGum
Reply all
Reply to author
Forward
0 new messages