--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Router OUT -> IN Process OUT -> IN SaveToDatabase
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
That came to my mind as well. So I'd say FBP itself does not handle parallelism. A strategy is put on top of it in the same manner than GoF patterns are applied on top of OOP.
What classic FBP doesn't handle is implicit parallelism (my own avenue of development).
However, at present core counts, I don't believe this is a very big issue just yet.Classic FBP should generally be able to keep up to 8 cores fairly busy without problem.And the higher core counts today tend to be servers - they will have multiple users to keep the extra cores busy.
What classic FBP doesn't handle is implicit parallelism (my own avenue of development).Thanks Bert. That aligns with what I assume as well. Just to be sure, so do you mean you prefer implicit parallelism over explicit one?
However, at present core counts, I don't believe this is a very big issue just yet.Classic FBP should generally be able to keep up to 8 cores fairly busy without problem.And the higher core counts today tend to be servers - they will have multiple users to keep the extra cores busy.I'm more interested in distributed computing across a physical network. Say hundreds of nodes.
It should follow the same general rules?
I haven't had experience with FBP at that scale so I'd appreciate any insight there.
There's a concept called multiplexing factor, in which a stream is consumed by n copies of a process. The problem is obviously reassembling the output in order.
The inherent problem of automatic parallelism is that even if we have n cores available, any shared resource becomes the bottleneck, and in networked distributed systems it usually is the network throughput, or hard disk access.
I think that the multiplexing concept needs to be developed further, it provides parallelism where the graph designer deems it necessary.
I agree that the defining property of fbp is the IP, which is a context by itself that moves inside the graph, making it well suited for parallelism and distribution.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
Because networking is so slow relative to in-memory processing, you want to minimize inter-node communication as much as possible.I can't emphasize this enough.Streaming a million records to another node for additional processing seems a cool thing to do, but really, it is a performance killer.This is true no matter what sort of processing you are doing.
It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.
One technique that helps is that, if you filter, you filter as early as possible - as close as possible to the actual data source.
Also, inter-node communication is really serial streaming. If you are doing parallel streaming, you will need to build it on top of that.Additionally, some of the signaling mechanisms used for simulating back pressure in a parallel flow environment don't work so well between nodes.In general, this doesn't have to be a problem, but you do need to consider it explicitly in your design.Oh and one more thing. Distributed computing pretty much requires passing only immutable data - which is also pretty much a requirement for parallel flow.
Classic FBP has sort of gotten by without requiring immutable data by imposing restrictions via programming convention, but personally I think it leaves an avenue open for possible abuse. Sadly, I have observed that some programmers tend to dislike following programming conventions to the point of ignoring them entirely.
Paul, What strategies have been considered for multiplexing? I have some thoughts about assigning unique ids to packets and also tracking their lifetime for failure handling, but I'd like to know what has been tried.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
Because networking is so slow relative to in-memory processing, you want to minimize inter-node communication as much as possible.I can't emphasize this enough.Streaming a million records to another node for additional processing seems a cool thing to do, but really, it is a performance killer.This is true no matter what sort of processing you are doing.It does sound like explicit handling of IPs turns out to be a better approach than implicit parallelism.
It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.The tracking part is done by storing the context within the IPs. Would that be correct to say?
One technique that helps is that, if you filter, you filter as early as possible - as close as possible to the actual data source.The same argument for using stored procedures in the database layer, I suppose.
Also, inter-node communication is really serial streaming. If you are doing parallel streaming, you will need to build it on top of that.Additionally, some of the signaling mechanisms used for simulating back pressure in a parallel flow environment don't work so well between nodes.In general, this doesn't have to be a problem, but you do need to consider it explicitly in your design.Oh and one more thing. Distributed computing pretty much requires passing only immutable data - which is also pretty much a requirement for parallel flow.This aligns with my belief that all IPs should be immutable data.Classic FBP has sort of gotten by without requiring immutable data by imposing restrictions via programming convention, but personally I think it leaves an avenue open for possible abuse. Sadly, I have observed that some programmers tend to dislike following programming conventions to the point of ignoring them entirely.That's only a natural thing to do though. If you have a pressing deadline and there's way around some restrictions, the only sane decision is to take it.
I'm not sure why you think that follows. It shouldn't follow from my previous statement which refers to distributed programming in general.As I said - no matter what sort of processing you are doing.
It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.The tracking part is done by storing the context within the IPs. Would that be correct to say?No. IP's should generally be self-contained. Preferably immutable in most cases (required in my case).Also in my case, the framework handles the inter-node tracking.
Mmm, I'm going to suggest the sane thing to do is to first investigate why the restrictions are there.
--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
I'm not sure why you think that follows. It shouldn't follow from my previous statement which refers to distributed programming in general.As I said - no matter what sort of processing you are doing.I think I jumped the gun there. My logic was that if your scheduler/compiler/coordinator understands what is being processed (as in Dedalus' case but not in FBP's case), the scheduler/compiler/coordinator could compile and deploy parts of the code to the appropriate places to minimize traffic. I was thinking FBP and implicit parallelism, which don't seem to make sense together for that reason. So yes, I was zooming into FBP specifically. My bad on that.
It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.The tracking part is done by storing the context within the IPs. Would that be correct to say?No. IP's should generally be self-contained. Preferably immutable in most cases (required in my case).Also in my case, the framework handles the inter-node tracking.I was under the impression that the tracking part would be done with some kind of token in the IP that is not accessible to the processes, not unlike how IP packets (the other one) work where applications at the TCP level would have no access to the IP headers.
Mmm, I'm going to suggest the sane thing to do is to first investigate why the restrictions are there.Sorry for my sarcasm. It was too subtle in hindsight. I meant that if you could get around a restriction, would that really be a restriction? I'm in the camp of "if a restriction is violated, the program can't even run."
Hi Bert,
Can I clarify what I mean by immutable IPs?
Paul describes FBP with regard to IPs as follows:
" The FBP methodology requires the designer to lay out the IPs first, and then define the transforms which apply to them. "
http://www.jpaulmorrison.com/fbp/concepts_book.shtml
How can we apply transforms to immutable IPs? IPs represent real world entities and the way in which that change.
Modelling reality, where change is the only constant is central to FBP. As summed up in the first chapter's epigraph of "Panta Reih":
"Heraclitus' philosophy can be captured in just two words: 'panta rei', literally everything flows, meaning that everything is constantly changing, from the smallest grain of sand to the stars in the sky."
http://www.optionality.net/heraclitus/
Change is the "flow" in Flow Based Programming, and the IP is where that change is modelled.
Make the IP immutable and you have removed the flow.
You no longer have flow based programming. You have functional programming, a completely different paradigm.
Of course, we can use immutable data structures to implement our IPs. We could implement IPs as an immutable identity that maps to an immutable state. In this case a transform will not alter the internal representation but instead point to a new one, in the same way that Java implements String.
However, the IP is not immutable. The state represented by the IP's identity is constantly changing.
The conceptual problem that I've found while implementing immutable IPs was that it becomes harder to track the lifetime of an entity in the system. The compromise I made was to add a debug mode that stacks the new state inside the packet and the name of the process that did the change.
The conceptual problem that I've found while implementing immutable IPs was that it becomes harder to track the lifetime of an entity in the system. The compromise I made was to add a debug mode that stacks the new state inside the packet and the name of the process that did the change.
Hi Humberto,
The real world is not being added, it is a vital part of what an IP is. It is an analogue for a real world object.
It is this analogue relationship with the real world that makes it an _information_ packet rather than a _data_ packet.
http://www.differencebetween.info/difference-between-data-and-information
The relationship is mapped based on Identity.
We can describe an IP as a tuple: <Identity, State>
The Identity maps the state to some real world entity. The state encodes information that relates to that entity. For the system to be useful then that state must change. So our tuple must mutate. The state mapped to this specific identity must be able to change over time.
See also the Domain Driven Design concept of Entity:
http://culttt.com/2014/04/30/difference-entities-value-objects/
Entities are mutable and addressable, values are not. Values can flow, entities can not.Unfortunately FBP breaks the last rule - it allows entity data to flow (although not necessarily the authoritative entity).Partly this is because data does not truly flow in FBP. What actually flows is a reference to a piece of data, the actual data itself remains in a fixed location in memory.I see this as a weakness in the FBP specification, which can be fixed by imposing immutability.
Humberto Madeira scripsit:
> Streaming a million records to another node for additional processing
> seems a cool thing to do, but really, it is a performance killer.
Such absolute statements only make sense in specific contexts.
1) An example given in the WP article on MapReduce involves processing
1.1 billion records representing people. Each record contains a list
of social contacts and an age. What we want is the average number of
contacts for each age, so we have to reduce the 1.1 billion records to
just 96, one per age. Trying to do this without FBP (of which MapReduce
is a special case) is hopeless without distributed processing. You need
a little more than a thousand CPUs.
2) The New York Times needed to convert 4 TB of raw TIFF images to 11
million PDFs. They used 100 Amazon EC2 boxes and MapReduce to do the
conversion in 24 hours at a cost of only $240 (not counting storage
charges).
3) The Compact Muon Solenoid, part of LHC at CERN, generates about 40 TB/s
of raw data, but most of these event are uninteresting. Consequently,
a two-stage FBP pipeline reduces the data first to 50 KB/s (this part is
done by custom hardware based on FGPA) and then to 500B/s. Even this
cannot be analyzed in real time, so it is saved to tape and processed
offline.
--
I don't think that automatic is necessarily the answer.
What is needed is the ability to abstract away the details of connectivity.Lets say we have four types of connector: ReadFile, Output, Transmit and Recieve.ReadFile reads characters from a file. Output prints characters to a console. Transmit sends characters across a network connection. Receive receives characters across a network connection.Functionally the following two graphs are identical:[a] ReadFile -> Output[b] ReadFile -> Transmit -> Receive -> OutputThe only difference is that in the graph [b] the file and the console might be on separate machines.Ideally I want to be able to take [a] and easily convert it into [b]. I'd also like to abstract away the network elements of [b] and view it as [a].
It would also be good to have a mechanism for tracking the lifetime of an IP across machines.
At the moment Transmit would have to DELETE it's IPs and Receive would have to CREATE new ones.
It would be good to have the ability to SEND an IP over the network.
--
[...]
You don't need a BDUF - you do need to sketch out the design, prototype the bits you are unsure about, and then extend outwards, or from the back end backwards, adding function as needed. At any time, you can move processes around, re-cleave the overall network, etc. BDUF is very much an attempt to compensate for the rigidity of the old programming technology.
Regards,Paul
It bothers me that industry has rejected engineering rather than striving to master the practices of good engineering.
Interesting examples, John! And, as you say, very amenable to FBP.
I understand the overhead of transmitting data across a network, but this is just one of the options available to an FBP application designer. Conceptually, you build a very high level network, and then, at some point, you decide to spread the network across the available hardware, adding "communication processes" as required.
In my book (Chap. XV, 2nd ed.), I call this process "cleaving". By the way, in FBP this is not an irreversible process, unlike conventional programming.
Quote from the chapter: "When you think about the ways a flow can be chopped up, remember you can use anything that communicates by means of data, which in data processing means practically any piece of hardware or software."
Of course, you try to minimize the amount of data being transmitted over the slower paths, but IMO this should be the designer's decision.
Bert, Ged, Alfredo, Ken, do all/any of you feel this should be done automatically? If so, I have to say this goes against my experience. That's not to say, however, that judicious use of measurement and tracking tools cannot provide useful guidance!
--
Regarding doing it automatically....I don't think that the distribution should be hidden within the flow graph, hidden away in the engine. It should be a component within the graph. There should be enough metadata to filter it out if desirable, but it should be fully accessible.Automatically generating the flow graph based, that's a different thing entirely. I think that approach has a lot of promise.
As long as I can see the full graph, just as I do when viewing an SQL execution plan.How far is that from the approach you are taking?
Humberto Madeira scripsit:
> The fact is, passing 1.1 billion records back and forth is a performance
> killer.
That's like saying "Spending a billion dollars is a profit killer." It
depends on what that billion dollars buys you. If what you want is a
heavy-lift rocket to the Moon, it's cheap.
> Hadoop can be one way to perform the distributed query and aggregate the
> results, but Hadoop is not FBP.
Hadoop is indeed FBP, a particular instance of an FBP network.
"As another example, imagine that for a database of 1.1 billion people"
"Pretty quickly I thought about how we could do this (and have some fun along the way)"