Topology Deployment Model

734 views
Skip to first unread message

P. Taylor Goetz

unread,
Nov 17, 2011, 5:40:42 PM11/17/11
to storm...@googlegroups.com
Hey Nathan,

I'm working on the storm-maven-plugin and have hit a wall related to how topologies are submitted to clusters (local and remote).

The usage scenarios for the plugin can be found here: https://github.com/ptgoetz/storm-maven-plugin

Currently the convention is to define a class (which I'll call a "driver") with a main() method that builds the topology and then decides whether to deploy to a local cluster or a remote cluster.

For the maven plugin, I was hoping to have separate goals for running locally ("mvn storm:run") or submitting to a cluster ("mvn storm:jar"). The problem here is that since the local vs. remote logic is tied up in the main() method, I don't have control over it.

Okay, so for a second let's forget about separate run/jar options and assume that the "driver" class always decides how to deploy the topology.

If we let driver class make the choice, I was thinking the plugin mojo could just call the driver class' main() method. I went down this path, but ran into 2 issues:

1. "StormSubmitter" gets the path to the jar file from the "STORM_JAR" environment variable. To get past that I either have to rely on the maven user setting the variable, or I have to use java.lang.ProcessBuilder to spawn a JVM to run the driver class' main() method (not very elegant, but should work).

2. "StormSubmitter" via "Utils.java" will pick up the nimbus host parameter from ~/.storm/storm.yaml, so to be able to make nimbus host configurable from the maven environment I would have to do something like: move ~/.storm/storm.yaml if it exists; write a temporary storm.yaml; run the deploy process; move the original back.

So as you can see (if I explained that well -- which I don't think I did ;) ) things start to fall apart and get pretty kludgy with the current model, and some of the features I have planned for the plugin start to drop. I'd really rather not have to muck with the user's environment if possible.

Ideally, I'd like to be able to use the StormSubmitter and NimbusClient classes directly, and use maven properties to build up the backtype.storm.Config object.

To do that I need some way to access the StormTopology instance for a class. Since, in the current model, the StormTopology instance is created in the main() method, I have no way of doing that.

So I guess what I'm requesting/proposing is something akin to hadoop's Tool/ToolRunner. So a "driver" class might look something like this ( TopologyDriver would be new interface that defines a buildTopology() method:

public class MyTopologyDriver implements TopologyDriver{

public StormTopology buildTopology(String[] args){
// build the topology…
return topology;
}

public static void main(String[] args){
TopologyDriver driver = new MyTopologyDriver();
StormTopology topology = driver.buildTopology(args);
Config conf = new Config();

StormSubmitter.submitTopology("my-topo", conf, topology);
}

}

That's just an example, not a thought out design, the idea is to expose the topology instance outside of a main() method so other tools can access it.

Hopefully that makes sense. :)

Thanks,

- Taylor

P. Taylor Goetz

unread,
Nov 17, 2011, 9:46:17 PM11/17/11
to storm-user
The more I think about it, the more important I think this issue might
be...

Based on the discussion surrounding articulating workflows on top of
storm and/or hadoop (http://groups.google.com/group/storm-user/
browse_thread/thread/b1d8732c5cbf0dfa?hl=en) this might very well come
into play in other scenarios.

So for discussion purposes lets just imagine we have some sort of
language/vocabulary/API that enables you to articulate a workflow that
could be applied to Storm (i.e. realtime) and/or hadoop (i.e. batch).

A storm-based implementation of this "workflow language" would likely
need to dynamically generate and deploy (potentially multiple)
topologies to a cluster. If that workflow was changed, previous
topologies that made up the workflow may need to be killed/swapped/
redeployed.

I would think that use case would also require a more flexible
deployment model -- i.e. deployment via a Storm API rather than CLI
tools.

Just a thought.

- Taylor

Nathan Marz

unread,
Nov 19, 2011, 2:28:58 PM11/19/11
to storm...@googlegroups.com
I thought about this for awhile. Here's a question: what does it mean to choose between running a topology locally vs. remotely? When you submit a topology remotely, it runs forever. So what do you do in local mode? Do you just ctrl-c it when you're done watching it? What about cases like Distributed RPC, where testing it locally means setting up a LocalDRPC server and issuing DRPC requests to it, much more than just running a topology?

I have some potential solutions for #1 and #2. For #1, I can change how the storm jar is passed in by using a java property rather than an environment variable. So the jar would be passed in like this:

java -Dstorm.jar={path to jar} ...

Would this make it easier to integrate with Maven?

For #2, there are a few options. First of all, Storm doesn't explicitly look in ~/.storm for the storm.yaml file. The way Storm works is that it searches for storm.yaml on the classpath. The "storm" client adds ~/.storm to the classpath. So one solution to your problem would be to put the storm.yaml somewhere within the project and just add its directory to the classpath using Maven.

Another solution I'm considering is creating the possibility to set configurations using java properties rather than requiring that nimbus.host be set in a storm.yaml file. So something like this:

java -Dstorm:nimbus.host={location of nimbus} -Dstorm:nimbus.thrift.port={override the port here}

Let me know your opinion between these two approaches.

-Nathan

--
Twitter: @nathanmarz
http://nathanmarz.com

Thomas Dudziak

unread,
Nov 20, 2011, 1:03:38 AM11/20/11
to storm...@googlegroups.com
For #2, please use system properties which imho is the most flexible option. Any other mechanism could be implemented on top of this (e.g. the yaml file).

cheers,
Tom

P. Taylor Goetz

unread,
Nov 20, 2011, 11:16:07 AM11/20/11
to storm...@googlegroups.com
Nathan,

Thanks for your input and consideration on this. I appreciate it.

TL;DR; - Switching over to Java system properties would allow me to move forward with the Maven plugin.

In the long run however, I think Storm could benefit from a "Deployment API" that allows tools to deploy Storm topologies programmatically, without having to rely on running an external process/command. Examples of such tools would be the "storm jar" command as well as the maven plugin.

I can also think of other use cases. For example, in our environment, outside of the development group, we have QA and Client Solutions groups that will likely need the ability to manage storm topologies (the same would hold true for Hadoop jobs). They would not necessarily develop the topologies, but would need the ability to deploy/kill as well as alter input parameters that alter the behavior of topologies. Having a tool to do this would make it easier on them. I could also see this type of functionality potentially incorporated into Storm UI.

Another use case would be dynamic topology generation and deployment (i.e. a pig or cascading-like tool for Storm).

Such a tool would likely use the StormSubmitter.submitTopology() method, and would need a reference to a StormTopology instance. Which goes back to my argument for a method for accessing the StormTopology instance. For example:

String topoClassName = "com.foo.bar.MyTopology";
Class clazz = Class.forName(topoClassName);
TopologyDriver topoDriver = (TopologyDriver)clazz.newInstance();
StormTopology topo = topoDriver.buildTopology(args);

StormSubmitter.submitTopology(…);


Thanks for listening,

- Taylor




On Nov 19, 2011, at 2:28 PM, Nathan Marz wrote:

I thought about this for awhile. Here's a question: what does it mean to choose between running a topology locally vs. remotely? When you submit a topology remotely, it runs forever. So what do you do in local mode? Do you just ctrl-c it when you're done watching it? What about cases like Distributed RPC, where testing it locally means setting up a LocalDRPC server and issuing DRPC requests to it, much more than just running a topology?

In my opinion (and just my opinion), you run in local mode to test during development. You either sleep the thread for a bit, and then shut down:

LocalCluster cluster = new LocalCluster();
cluster.submitTopology("storm-jms-example", conf, builder.createTopology());
Utils.sleep(120000);
cluster.killTopology("storm-jms-example");
cluster.shutdown();

Or you just let it run until the user/developer hits ctrl-c to shut it down.

I'm assuming this is what the LocalCluster class is for, am I wrong?

As far as DRPC, to be honest, I haven't had a chance to look at it yet (though it's on our radar), so I can't offer any meaningful comment.

I'll have to take a look at DRPC… our main use of Storm doesn't rely on it yet. But I'll try to take a look at it soon.




I have some potential solutions for #1 and #2. For #1, I can change how the storm jar is passed in by using a java property rather than an environment variable. So the jar would be passed in like this:

java -Dstorm.jar={path to jar} ...

Would this make it easier to integrate with Maven?

Yes. Initially...


For #2, there are a few options. First of all, Storm doesn't explicitly look in ~/.storm for the storm.yaml file. The way Storm works is that it searches for storm.yaml on the classpath. The "storm" client adds ~/.storm to the classpath. So one solution to your problem would be to put the storm.yaml somewhere within the project and just add its directory to the classpath using Maven.

Another solution I'm considering is creating the possibility to set configurations using java properties rather than requiring that nimbus.host be set in a storm.yaml file. So something like this:

java -Dstorm:nimbus.host={location of nimbus} -Dstorm:nimbus.thrift.port={override the port here}

Let me know your opinion between these two approaches.

Being able to pass in parameters with "-D" options would help get past the immediate issues. But in terms of a maven plugin, we'd still be required to fork a JVM process in order to get it to work.

nathanmarz

unread,
Nov 21, 2011, 2:05:48 AM11/21/11
to storm-user
To do custom stuff, you could always use the underlying Thrift
interface that StormSubmitter uses. The StormSubmitter implementation
is really straightforward. Although I will make StormSubmitter itself
more flexible -- I'll make it so that some function is exposed such
that you can specify everything you need completely programmatically,
without reliance on external configs or environment variables.

As for having a ToolRunner type interface, I don't see anything that's
stopping you from doing that yourself. There's no reason why that
really has to be a first class concept for Storm. I prefer to provide
the libraries for creating topologies and submitting them, and then
you can manage that process yourself.

-Nathan

Reply all
Reply to author
Forward
0 new messages