One concern is that the cloud instantiation will take some time, so
the startup commands are blocking. Do the tasklets relinquish control
by themselves while blocking or do I need to fork another process to
do the setup on the command-line?
Multitasking in pypes is cooperative but not preemptive. It is the the tasklets responsibility to release control of the CPU using yield_ctrl().
The scheduler will not preempt. This decision is largely based on the fact that there's a specific chain of dependency (i.e., B depends on A therefore preempting A to run B provides little use). This also simplifies the scheduler code.
It also provides the flexibility to control the flow of execution inside the components rather than having the scheduler enforce its own semantics. You can call yield_ctrl() at any point and know that the next time you're called, execution will begin where it left off.
With that said, there are some caveats once you understand how scheduling works.
The scheduler itself is a tasklet. It takes a user's graph and performs a topological sort which provides a linear ordering based on dependencies. It loads the tasklets in the order they will run and then places itself as a blocking tasklet at the front.
When data is sent from the HTTP layer it's sent through a Stackless channel (asynchronously) which the scheduling tasklet is listening on. Once the channel sees data it causes the scheduling tasklet to run which then causes all other tasklets to run in their topological ordering.
As each tasklet (component) calls yield_ctrl() it allows the next to run. Once the last tasklet calls yield_ctrl(), the next tasklet to run is the scheduling tasklet which once again blocks waiting for input.
So ideally (or under normal circumstances), a component has some sense of completion on iteration. Consider a situation where data arrived on a components input port but the component only processed a portion of that data. The remaining portion would still be buffered on the input port which could then only be processed by sending more data in.
As a way around this (which you've already discovered), you could send "triggers" into the system which contain null or junk data. The simple act of sending anything on the scheduling tasklets channel is enough to execute the graph.
That means that rather than sending data into the system, you could write a component that fetches some data. You can then cause this component to run by sending triggers to the system.
In the case of something more programmatic, you should be able to "control" things just by talking over the scheduling tasklets channel. In other words, you could essentially write your own scheduler that simply controls when triggers/data get sent into pypes. Then your components could yield intermittently and your new controlling tasklet could determine when to block (maybe once every components input ports are empty).
In terms of code executed at instantiation time, this actually happens in the asynchronously in the HTTP layer. When a component is dragged from the left side list, within the Javascript layer we invoke a POST request to the HTTP server (the pylons "filters" controller).
This controller provides a RESTful interface for manipulating components (a.k.a., filters).
# instantiates a new filter instance (POST body contains a class name)
POST /filters
# updates an instance with new configuration information (PUT body contains new JSON config)
PUT /filters/id
# deletes an instance represented by id
DELETE /filters/id
# will return a list of all the filters for the given type (where type is Adapter, Publisher, etc.)
GET /filters?node=type
The actual async JS call to create filter instances is here:
So in terms of this code blocking (your construction/destruction semantics), it actually doesn't block and this code executes outside the context of tasklet (which is bound to the run() method).
Keep in mind that if it takes really long to complete and you try wiring this component before it does, the back-end may not have registered the component yet. In this case it would look like you're trying to wire a component that doesn't yet exist.
One option may be to leverage the fact that pypes provides these RESTful semantics. When a composite component is created, you could essentially invoke the REST actions to construct the graph remotely. Since this could all be done in Javascript, you could ensure nothing blocks. Then your controlling tasklet could communicate with the other node over HTTP the same way data is sent in to any instance of pypes (i.e., POST request to /docs controller). That call itself is also asynchronous. The subgraph or composite instance could then write to a local sink that your controlling tasklet occasionally checks for completed jobs.
Then just write a publisher on the composite instance that actually serializes the pypes Packet to disk where it can be pulled via a GET request (a serialized Packet would ideally be JSON but you could just pickle it)
This might help keep the communication channels simple using HTTP. Keep in mind that the pypes UI is nothing more than a bunch of Javascript code that invokes methods on the server for creating and manipulating graphs. All of that functionality is exposed by design and it's available over HTTP because we envisioned the sort of semantics you're attempting here and we wanted to have the flexibility to do some of these things one day.
So anything you can do in the UI is exposed over HTTP meaning that you should have the flexibility to completely control the backend using any language or code that supports the HTTP protocol. Pypes was designed for the web and should therefore be able to operate within a cloud environment but you're group is the first to actually attempt it.
As a last note, keep in mind that pypes supports content negotiation of compressed (gzip) data. If you data is relatively large, have your controlling tasklet compress it before sending it to the remote instance. Just set the proper header and the other server will be sure to decompress the message before sending it through the graph.
We use this technique along with batching to send lots of data over a relatively slow protocol. If we're feeding millions of documents, we'll typically batch then in groups of one thousand and then compress the payload before we submit it. This reduces the number of HTTP calls. We also talked about supporting Google Protocol Buffers in addition to HTTP.
Hope this helps.
-Eric