Interesting talk bringing together data flow, datalog, logical time and distributed systems

261 views
Skip to first unread message

Samuel Lampa

unread,
Sep 28, 2015, 1:09:48 PM9/28/15
to Flow Based Programming
Hi,

I think some of you in this group might find this talk by Peter Alvaro very interesting (from this year's StrangeLoop conference):


I think Peter does a great job at explaining in a very clear way how data flow and query languages (datalog in general, for reasons he explains) are closely related and brings this forward to suggest some directions for how to hopefully get better consistency guarantees in distributed systems etc.

What I especially found interesting was how to view query languages as a declarative representation of data flow systems ... and trying to take this idea forward and make also (logical) time relationships declarative, in a distributed systems setting.

Although I think the ideas are not new to many of you here, I thought it was a great presentation that inspired to think more about the intersections of these concepts.

As always, I'm intrigued to hear feedback about this from you guys!

Best
// Samuel

Kenneth Kan

unread,
Sep 29, 2015, 10:38:22 PM9/29/15
to Flow Based Programming
I really like the focus of the language, as opposed to a general purpose language, if there's one at all. Specifically, time is a foundational concept of the system.

There's quite a few parallels with FBP that I see. It is almost language-agnostic if you squint. He uses datalog as the base language but any declarative/logic programming language should work as well, just with some additional constructs to match the system semantics. FBP of course is the same way, being almost language agnostic. It's timely in that we are having this other discussion on whether FBP can be done in a conventional single-threaded JS environment.

What Paul and Ged portrayed to me that sets FBP apart is also related to time, and embracing statefulness across time. I think the term is "time-stretched processes"?

It's interesting though that FBP seems to take the opposite approach of Dedalus' in that time is intentionally hidden. Rather, data flow is at the center. I wonder if Dedalus can be said to be more "powerful" than FBP in the sense that it encapsulates both concepts under one framework. For instance, while concurrency is handled at the FBP network level, FBP does not seem to offer a complete solution when it comes to parallelism (e.g. what happens when the network connection drops between process A and B?). Any thought?

Alfredo Sistema

unread,
Sep 29, 2015, 11:49:19 PM9/29/15
to Flow Based Programming
A network sliced graph would require processes on each side to handle the connection logic, under whatever distribution model is being used. The rest of the program on each side should not need to care.

A OUT -> IN NetworkSender
NetworkReceiver OUT -> B

Nothing fancy.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kenneth Kan

unread,
Sep 30, 2015, 12:34:46 AM9/30/15
to Flow Based Programming
Right. I was referring not to the conceptual model but the physical connection itself. Take this trivial network:

Router OUT -> IN Process OUT -> IN SaveToDatabase

True that none of these would be aware of each other, but what tricks do I have if I want to make this network highly available given that each process is on a separate machine?

To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

Ged Byrne

unread,
Sep 30, 2015, 1:56:25 AM9/30/15
to Flow Based Programming
Hi Ken,

When it comes to dealing with issues like connection failures FBP addresses these by tracking IPs.

Each IP is given an identity at creation and a tally of every send and receive is kept until the IP is deleted. If the IP account does not balance at the end the problem is signalled.

Checkpointing strategies are also easily implemented, as described in Paul's book:http://www.jpaulmorrison.com/fbp/checkpt.shtml

The significance of the IP appears to be the one thing that distinguishes FBP from the other data flow approaches.

Regards,


Ged
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

Kenneth Kan

unread,
Sep 30, 2015, 10:22:43 AM9/30/15
to Flow Based Programming
That came to my mind as well. So I'd say FBP itself does not handle parallelism. A strategy is put on top of it in the same manner than GoF patterns are applied on top of OOP.

The significance of IPs in FBP is indeed a selling point for me too. And I was wondering whether having explicit control over the data has any advantage over Dedalus' implicit guarantee of eventual effect. The curiosity stems from the thought that if two approaches provide the same level of guarantees (in this case it's that you don't have lost data that you can't track) then you'd always pick the more expressive one. Thoughts?
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

Ged Byrne

unread,
Sep 30, 2015, 12:32:50 PM9/30/15
to Flow Based Programming
Hi Ken,

Occam's razor says you should pick the simpler one.

For example, Isn't SGML more expressive than HTML?

Regards,


Ged
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

Kenneth Kan

unread,
Sep 30, 2015, 1:28:55 PM9/30/15
to Flow Based Programming
In my mind there are two separate concepts: expressiveness and generality. SGML is more general than HTML but not necessarily more expressive. By expressiveness I roughly mean "you can write less to achieve the same amount, if not more". It felt to me that FBP was more general than Dedalus as it wasn't only for data query systems, while one can arguably achieve more with Dedalus. For one, the graph designer (in this case just the developer) doesn't need to apply a checkpoint strategy; it's free in Dedalus.

I can see that with FBP you get reuse. I just haven't had enough experience in Dedalus-like systems to judge whether that's the case as well. Hence the question.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

Humberto Madeira

unread,
Sep 30, 2015, 2:30:11 PM9/30/15
to Flow Based Programming
Hi Ken,

On Wednesday, 30 September 2015 10:22:43 UTC-4, Kenneth Kan wrote:
That came to my mind as well. So I'd say FBP itself does not handle parallelism. A strategy is put on top of it in the same manner than GoF patterns are applied on top of OOP.

Based on what Paul has himself described, classic FBP can handle parallelism explicitly.  You can split a stream and run the IP's through multiple parallel components.  (and then recombine them afterward)

What classic FBP doesn't handle is implicit parallelism (my own avenue of development).  

However, at present core counts, I don't believe this is a very big issue just yet.  
Classic FBP should generally be able to keep up to 8 cores fairly busy without problem.
And the higher core counts today tend to be servers - they will have multiple users to keep the extra cores busy.

Regards,
--Bert

Kenneth Kan

unread,
Oct 1, 2015, 6:37:52 AM10/1/15
to Flow Based Programming
What classic FBP doesn't handle is implicit parallelism (my own avenue of development).   

Thanks Bert. That aligns with what I assume as well. Just to be sure, so do you mean you prefer implicit parallelism over explicit one?
 

However, at present core counts, I don't believe this is a very big issue just yet.  
Classic FBP should generally be able to keep up to 8 cores fairly busy without problem.
And the higher core counts today tend to be servers - they will have multiple users to keep the extra cores busy.


I'm more interested in distributed computing across a physical network. Say hundreds of nodes. It should follow the same general rules? I haven't had experience with FBP at that scale so I'd appreciate any insight there.

Humberto Madeira

unread,
Oct 1, 2015, 10:41:59 AM10/1/15
to Flow Based Programming
Hi Ken,


On Thursday, 1 October 2015 06:37:52 UTC-4, Kenneth Kan wrote:

What classic FBP doesn't handle is implicit parallelism (my own avenue of development).   

Thanks Bert. That aligns with what I assume as well. Just to be sure, so do you mean you prefer implicit parallelism over explicit one?

Yes. And I am far from being the only one,  Implicit parallelism, in general, if it can be made useful and performant, is currently considered to be one of the "holy grails" of computing.

In terms of flow programming, parallel flow (with implicit parallelism) is my main area of focus.
There are many areas of overlap with FBP, but also some areas that are quite different.
 
 

However, at present core counts, I don't believe this is a very big issue just yet.  
Classic FBP should generally be able to keep up to 8 cores fairly busy without problem.
And the higher core counts today tend to be servers - they will have multiple users to keep the extra cores busy.


I'm more interested in distributed computing across a physical network. Say hundreds of nodes.

Strangely enough, me too :-)
 
It should follow the same general rules?

Yes and no.  The key thing to consider is that it takes a lot more time to communicate between nodes than components in an in-memory flow network..  
 
I haven't had experience with FBP at that scale so I'd appreciate any insight there.

Because networking is so slow relative to in-memory processing, you want to minimize inter-node communication as much as possible.  
I can't emphasize this enough.

Streaming a million records to another node for additional processing seems a cool thing to do, but really, it is a performance killer.
This is true no matter what sort of processing you are doing.

It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.

One technique that helps is that, if you filter, you filter as early as possible - as close as possible to the actual data source.

Also, inter-node communication is really serial streaming. If you are doing parallel streaming, you will need to build it on top of that.

Additionally, some of the signaling mechanisms used for simulating back pressure in a parallel flow environment don't work so well between nodes.
In general, this doesn't have to be a problem, but you do need to consider it explicitly in your design. 

Oh and one more thing.  Distributed computing pretty much requires passing only immutable data - which is also pretty much a requirement for parallel flow.  

Classic FBP has sort of gotten by without requiring immutable data by imposing restrictions via programming convention, but personally I think it leaves an avenue open for possible abuse.  Sadly, I have observed that some programmers tend to dislike following programming conventions to the point of ignoring them entirely.

Regards,
--Bert

John Cowan

unread,
Oct 3, 2015, 10:23:33 AM10/3/15
to flow-based-...@googlegroups.com
Kenneth Kan scripsit:

> That came to my mind as well. So I'd say FBP itself does not handle
> parallelism. A strategy is put on top of it in the same manner than GoF
> patterns are applied on top of OOP.

Underneath it, rather. If the framework provides parallelism, FBP can
exploit it up to and including the number of components, since each
component can execute on a separate hyperthread/core/CPU.
--
John Cowan http://www.ccil.org/~cowan co...@ccil.org
Thor Heyerdahl recounts his attempt to prove Rudyard Kipling's theory
that the mongoose first came to India on a raft from Polynesia.
--blurb for Rikki-Kon-Tiki-Tavi

Alfredo Sistema

unread,
Oct 3, 2015, 2:51:30 PM10/3/15
to flow-based-...@googlegroups.com

There's a concept  called multiplexing factor, in which a stream is consumed by n copies of a process. The problem is obviously reassembling the output in order.
The inherent problem of automatic parallelism is that even if we have n cores available, any shared resource becomes the bottleneck, and in networked distributed systems it usually is the network throughput, or hard disk access.
I think that the multiplexing concept needs to be developed further, it provides parallelism where the graph designer deems it necessary.
I agree that the defining property of fbp is the IP, which is a context by itself that moves inside the  graph, making it well suited for parallelism and distribution.


--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

Paul Morrison

unread,
Oct 3, 2015, 5:48:58 PM10/3/15
to flow-based-...@googlegroups.com
Alfredo, as you say, it's a pain to recombine data that has been split using the LoadBalance function, so we don't do that unless we have to.

LoadBalance has been updated (in JavaFBP and C#FBP so far) to make it substream-sensitive - so correspondingly, if you do want to merge the streams back, you have to use a substream-sensitive merge.  In the case of this diagram, this would funnel everything back into one WebsocketsRespond process - which is not necessary, so I use multiple instances of that process as well! 

Regards,

Paul

Kenneth Kan

unread,
Oct 3, 2015, 6:39:56 PM10/3/15
to Flow Based Programming

Because networking is so slow relative to in-memory processing, you want to minimize inter-node communication as much as possible.  
I can't emphasize this enough.

Streaming a million records to another node for additional processing seems a cool thing to do, but really, it is a performance killer.
This is true no matter what sort of processing you are doing.

It does sound like explicit handling of IPs turns out to be a better approach than implicit parallelism.
 

It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.

The tracking part is done by storing the context within the IPs. Would that be correct to say?
 

One technique that helps is that, if you filter, you filter as early as possible - as close as possible to the actual data source.

The same argument for using stored procedures in the database layer, I suppose.
 

Also, inter-node communication is really serial streaming. If you are doing parallel streaming, you will need to build it on top of that.

Additionally, some of the signaling mechanisms used for simulating back pressure in a parallel flow environment don't work so well between nodes.
In general, this doesn't have to be a problem, but you do need to consider it explicitly in your design. 

Oh and one more thing.  Distributed computing pretty much requires passing only immutable data - which is also pretty much a requirement for parallel flow.  

This aligns with my belief that all IPs should be immutable data.
 

Classic FBP has sort of gotten by without requiring immutable data by imposing restrictions via programming convention, but personally I think it leaves an avenue open for possible abuse.  Sadly, I have observed that some programmers tend to dislike following programming conventions to the point of ignoring them entirely.


That's only a natural thing to do though. If you have a pressing deadline and there's way around some restrictions, the only sane decision is to take it.

Alfredo Sistema

unread,
Oct 3, 2015, 8:09:00 PM10/3/15
to Flow Based Programming

Paul, What strategies have been considered for multiplexing? I have some thoughts about assigning unique ids to packets and also tracking their lifetime for failure handling, but I'd like to know what has been tried.


--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

Humberto Madeira

unread,
Oct 3, 2015, 9:07:05 PM10/3/15
to Flow Based Programming
Hi Ken,


On Saturday, 3 October 2015 18:39:56 UTC-4, Kenneth Kan wrote:

Because networking is so slow relative to in-memory processing, you want to minimize inter-node communication as much as possible.  
I can't emphasize this enough.

Streaming a million records to another node for additional processing seems a cool thing to do, but really, it is a performance killer.
This is true no matter what sort of processing you are doing.

It does sound like explicit handling of IPs turns out to be a better approach than implicit parallelism.

I'm not sure why you think that follows.  It shouldn't follow from my previous statement which refers to distributed programming in general.
As I said - no matter what sort of processing you are doing. 
 

It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.

The tracking part is done by storing the context within the IPs. Would that be correct to say?

No.  IP's should generally be self-contained. Preferably immutable in most cases (required in my case). 
Also in my case, the framework handles the inter-node tracking.
 
 

One technique that helps is that, if you filter, you filter as early as possible - as close as possible to the actual data source.

The same argument for using stored procedures in the database layer, I suppose.

Exactly.
 
 

Also, inter-node communication is really serial streaming. If you are doing parallel streaming, you will need to build it on top of that.

Additionally, some of the signaling mechanisms used for simulating back pressure in a parallel flow environment don't work so well between nodes.
In general, this doesn't have to be a problem, but you do need to consider it explicitly in your design. 

Oh and one more thing.  Distributed computing pretty much requires passing only immutable data - which is also pretty much a requirement for parallel flow.  

This aligns with my belief that all IPs should be immutable data.
 

Classic FBP has sort of gotten by without requiring immutable data by imposing restrictions via programming convention, but personally I think it leaves an avenue open for possible abuse.  Sadly, I have observed that some programmers tend to dislike following programming conventions to the point of ignoring them entirely.


That's only a natural thing to do though. If you have a pressing deadline and there's way around some restrictions, the only sane decision is to take it.
 
Mmm, I'm going to suggest the sane thing to do is to first investigate why the restrictions are there.

Regards,
--Bert

Kenneth Kan

unread,
Oct 3, 2015, 9:40:45 PM10/3/15
to Flow Based Programming
I'm not sure why you think that follows.  It shouldn't follow from my previous statement which refers to distributed programming in general.
As I said - no matter what sort of processing you are doing. 

I think I jumped the gun there. My logic was that if your scheduler/compiler/coordinator understands what is being processed (as in Dedalus' case but not in FBP's case), the scheduler/compiler/coordinator could compile and deploy parts of the code to the appropriate places to minimize traffic. I was thinking FBP and implicit parallelism, which don't seem to make sense together for that reason. So yes, I was zooming into FBP specifically. My bad on that.

 

It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.

The tracking part is done by storing the context within the IPs. Would that be correct to say?

No.  IP's should generally be self-contained. Preferably immutable in most cases (required in my case). 
Also in my case, the framework handles the inter-node tracking.

I was under the impression that the tracking part would be done with some kind of token in the IP that is not accessible to the processes, not unlike how IP packets (the other one) work where applications at the TCP level would have no access to the IP headers.

 
Mmm, I'm going to suggest the sane thing to do is to first investigate why the restrictions are there.


Sorry for my sarcasm. It was too subtle in hindsight. I meant that if you could get around a restriction, would that really be a restriction? I'm in the camp of "if a restriction is violated, the program can't even run."

Ged Byrne

unread,
Oct 4, 2015, 12:41:02 PM10/4/15
to flow-based-...@googlegroups.com
Hi Bert,

Can I clarify what I mean by immutable IPs?

Paul describes FBP with regard to IPs as follows:

" The FBP methodology requires the designer to lay out the IPs first, and then define the transforms which apply to them. "
http://www.jpaulmorrison.com/fbp/concepts_book.shtml

How can we apply transforms to immutable IPs? IPs represent real world entities and the way in which that change. Modelling reality, where change is the only constant is central to FBP. As summed up in the first chapter's epigraph of "Panta Reih":

"Heraclitus' philosophy can be captured in just two words: 'panta rei', literally everything flows, meaning that everything is constantly changing, from the smallest grain of sand to the stars in the sky."
http://www.optionality.net/heraclitus/

Change is the "flow" in Flow Based Programming, and the IP is where that change is modelled.

Make the IP immutable and you have removed the flow. You no longer have flow based programming. You have functional programming, a completely different paradigm.

Of course, we can use immutable data structures to implement our IPs. We could implement IPs as an immutable identity that maps to an immutable state. In this case a transform will not alter the internal representation but instead point to a new one, in the same way that Java implements String.

However, the IP is not immutable. The state represented by the IP's identity is constantly changing.

Regards,


Ged

Regards,



Ged

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

Humberto Madeira

unread,
Oct 4, 2015, 12:49:50 PM10/4/15
to Flow Based Programming
Hi Ken,

On Saturday, 3 October 2015 21:40:45 UTC-4, Kenneth Kan wrote:
I'm not sure why you think that follows.  It shouldn't follow from my previous statement which refers to distributed programming in general.
As I said - no matter what sort of processing you are doing. 

I think I jumped the gun there. My logic was that if your scheduler/compiler/coordinator understands what is being processed (as in Dedalus' case but not in FBP's case), the scheduler/compiler/coordinator could compile and deploy parts of the code to the appropriate places to minimize traffic. I was thinking FBP and implicit parallelism, which don't seem to make sense together for that reason. So yes, I was zooming into FBP specifically. My bad on that.

FBP has no particular feature in it to handle distributed programming.  My own parallel flow framework does, but it uses a different paradigm for inter-node messaging.

Based on my limited skim of what Daedalus is doing, my own framework likely has some similarities. 
 

 

It is not just the cost of networking, it is also the cost of marshalling and un-marshalling the data, and tracking the communications.

The tracking part is done by storing the context within the IPs. Would that be correct to say?

No.  IP's should generally be self-contained. Preferably immutable in most cases (required in my case). 
Also in my case, the framework handles the inter-node tracking.

I was under the impression that the tracking part would be done with some kind of token in the IP that is not accessible to the processes, not unlike how IP packets (the other one) work where applications at the TCP level would have no access to the IP headers.

In my case, message tracking is done inside a special messaging framework called from a component in the flow network.  The IPs are not individually tracked.
 

 
Mmm, I'm going to suggest the sane thing to do is to first investigate why the restrictions are there.


Sorry for my sarcasm. It was too subtle in hindsight. I meant that if you could get around a restriction, would that really be a restriction? I'm in the camp of "if a restriction is violated, the program can't even run."


Yes, they are still really restrictions.  You may be in that camp, but that camp is an artificially manufactured illusion that depends on environments specially crafted for that purpose.

It sure would be nice to be able to rely on such specially crafted environments (in fact I am personally in the process of building something of the sort)

However, if you insist on using FBP (or any other asynchronous environment not specifically crafted for your camp), and don't want to follow the conventions, 
I would recommend you prepare for the sudden loss of a few weeks of your time sometime after your project has been released.

Regards,
--Bert

Humberto Madeira

unread,
Oct 4, 2015, 2:08:17 PM10/4/15
to Flow Based Programming
Hi Ged,


On Sunday, 4 October 2015 12:41:02 UTC-4, Ged Byrne wrote:
Hi Bert,

Can I clarify what I mean by immutable IPs?

Paul describes FBP with regard to IPs as follows:

" The FBP methodology requires the designer to lay out the IPs first, and then define the transforms which apply to them. "
http://www.jpaulmorrison.com/fbp/concepts_book.shtml

How can we apply transforms to immutable IPs? IPs represent real world entities and the way in which that change.

I'm not sure why you had to add "real world" here.  IP's are just objects.  Objects can model anything you want.
 
Modelling reality, where change is the only constant is central to FBP. As summed up in the first chapter's epigraph of "Panta Reih":

"Heraclitus' philosophy can be captured in just two words: 'panta rei', literally everything flows, meaning that everything is constantly changing, from the smallest grain of sand to the stars in the sky."
http://www.optionality.net/heraclitus/


Consider a Strar Trek style transporter.  Are the molecules of a transportee the same or different, and does it matter as long as the copy is perfect?

What if you keep both copies? (interestingly written Star Trek episodes aside)
 
Change is the "flow" in Flow Based Programming, and the IP is where that change is modelled.

So why must the IP itself remain the same going through a component?

What matters it if I replace the IP with an identical copy, or with a slightly modified one as opposed to changing the original?
 

Make the IP immutable and you have removed the flow.

You might want to think that's a corollary of flow, but it's not.  Nothing in FBP restricts me from passing an immutable IP.
 
You no longer have flow based programming. You have functional programming, a completely different paradigm.


Functional programming requires more than just immutable objects.  It requires there be no side effects.  FBP has side effects.
 
Of course, we can use immutable data structures to implement our IPs. We could implement IPs as an immutable identity that maps to an immutable state. In this case a transform will not alter the internal representation but instead point to a new one, in the same way that Java implements String.


OK.  So you know how it is done. 
 
However, the IP is not immutable. The state represented by the IP's identity is constantly changing.

Um, what?  If I pass in an immutable object as an IP, then the IP is immutable. 

And what do you mean that the state is constantly changing?

AFAIK, the state represented by the IP should only be changed by a component in the flow network, and at that by only one of them at a time.
(mind you, this is only a programming convention, and not enforceable like an immutable IP would be)

Regards,
--Bert

Alfredo Sistema

unread,
Oct 4, 2015, 2:26:02 PM10/4/15
to Flow Based Programming

The conceptual problem that I've found while implementing immutable IPs was that it becomes harder to track the lifetime of an entity in the system. The compromise I made was to add a debug mode that stacks the new state inside the packet and the name of the process that did the change.


Humberto Madeira

unread,
Oct 4, 2015, 3:37:20 PM10/4/15
to Flow Based Programming
Hi Alfredo

On Sunday, 4 October 2015 14:26:02 UTC-4, Alfredo wrote:

The conceptual problem that I've found while implementing immutable IPs was that it becomes harder to track the lifetime of an entity in the system. The compromise I made was to add a debug mode that stacks the new state inside the packet and the name of the process that did the change.



I have been looking at something similar. (although I don't need it very often)

I believe it is a good compromise.

Regards,
--Bert 

Ged Byrne

unread,
Oct 4, 2015, 3:58:42 PM10/4/15
to flow-based-...@googlegroups.com
Hi Humberto,

The real world is not being added, it is a vital part of what an IP is. It is an analogue for a real world object.

It is this analogue relationship with the real world that makes it an _information_ packet rather than a _data_ packet.

http://www.differencebetween.info/difference-between-data-and-information

The relationship is mapped based on Identity.

We can describe an IP as a tuple: <Identity, State>

The Identity maps the state to some real world entity. The state encodes information that relates to that entity. For the system to be useful then that state must change. So our tuple must mutate. The state mapped to this specific identity must be able to change over time.

See also the Domain Driven Design concept of Entity:
http://culttt.com/2014/04/30/difference-entities-value-objects/


Regards,


Ged

Humberto Madeira

unread,
Oct 4, 2015, 5:57:58 PM10/4/15
to Flow Based Programming


On Sunday, 4 October 2015 15:58:42 UTC-4, Ged Byrne wrote:
Hi Humberto,

The real world is not being added, it is a vital part of what an IP is. It is an analogue for a real world object.

It is this analogue relationship with the real world that makes it an _information_ packet rather than a _data_ packet.

http://www.differencebetween.info/difference-between-data-and-information


From your source: "Data usually refers to raw data, or unprocessed data."

I find that an interesting statement.  It means that sometimes data does not refer to raw data or unprocessed data.

Which means that the statement makes no logical assertion.  Just a fuzzy one.  And at that, with no evidence to back it up.

Also, your reference makes no assertion about the "real world" vs a model.

Here is one that does:

And here is another:

Granted that neither quote contains its own supporting data, but at least the authors are identified, and it is possible to back check the quality of their work.

Webster's defines a datum as "single piece of information; a factespecially a piece of information obtained by observation or experiment; - used mostly in the plural."

I note that especially is another weaselly non-assertion, so the first part is what really matters, a datum is "single piece of information; a fact", data is the plural.

It is really that simple.  So data means information, or facts.  And I would suggest that even the fact part is an implication that has been open to abuse.  Data may be untrue.

Here is another interesting take at the notion of facts even when they are true:

So information may not be fact, and fact may not be information.  And they are both considered to be data.
 
The relationship is mapped based on Identity.

We can describe an IP as a tuple: <Identity, State>
 
In fact, I do.  (sorry, couldn't help it)
 

The Identity maps the state to some real world entity. The state encodes information that relates to that entity. For the system to be useful then that state must change. So our tuple must mutate. The state mapped to this specific identity must be able to change over time.

See also the Domain Driven Design concept of Entity:
http://culttt.com/2014/04/30/difference-entities-value-objects/

In fact, I personally follow this model of programming quite closely.  I distinguish strongly between entities and values.  
Entities are mutable and addressable, values are not.  Values can flow, entities can not.

Unfortunately FBP breaks the last rule - it allows entity data to flow (although not necessarily the authoritative entity).
  
Partly this is because data does not truly flow in FBP.  What actually flows is a reference to a piece of data, the actual data itself remains in a fixed location in memory.

I see this as a weakness in the FBP specification, which can be fixed by imposing immutability.

Regards,
--Bert

Paul Morrison

unread,
Oct 4, 2015, 6:12:34 PM10/4/15
to flow-based-...@googlegroups.com
Ged, I have to agree with you!  I really relate to your remark that IPs are where change occurs!  You will have noticed that my little diagram - http://www.jpaulmorrison.com/graphicsstuff/ClientServerMultiplex.png - definitely does not require immutable IPs.  In fact, the processes labelled Process 0, 1 and 2 are where the changes occur - the rest basically do reformatting, signalling, etc.

Of course you can simulate change with immutable IPs by creating a copy of the incoming IP, doing the processing logic using the old and new IPs, outputting the new IP and discarding the old IP...  but that seems like a lot of overhead to do the same thing!

Regards,

Paul

Paul Morrison

unread,
Oct 4, 2015, 6:19:15 PM10/4/15
to flow-based-...@googlegroups.com
Hi Alfredo, that's an interesting idea!  Hadn't thought of that!  However, in my experience it is usually pretty obvious which process made a given change.  And you can always insert display processes to see the IPs on a particular connection.

Paul M.

Paul Morrison

unread,
Oct 4, 2015, 6:30:15 PM10/4/15
to flow-based-...@googlegroups.com


On Sun, Oct 4, 2015 at 5:57 PM, Humberto Madeira <kuna...@gmail.com> wrote:



Entities are mutable and addressable, values are not.  Values can flow, entities can not.

Unfortunately FBP breaks the last rule - it allows entity data to flow (although not necessarily the authoritative entity).
  
Partly this is because data does not truly flow in FBP.  What actually flows is a reference to a piece of data, the actual data itself remains in a fixed location in memory.

I see this as a weakness in the FBP specification, which can be fixed by imposing immutability.



Bert, I think I have to push back on this!  You've stated a rule that I disagree with, and then said that FBP breaks it!   Although that's also why I started using the term IP, rather than entity - your statement may be true for the usual definition of entity, but FBP doesn't use that term any more.

IMO FBP deals with streams of data chunks or IPs (or their references if you want), in the same way that a bottling plants works with streams of bottles. For me, values are attributes of the entities represented by the IPs - they cannot flow independently of the IPs they are part of.  Why is this a weakness in the FBP specification?

Regards,

Paul M.

John Cowan

unread,
Oct 5, 2015, 10:18:00 AM10/5/15
to flow-based-...@googlegroups.com
Humberto Madeira scripsit:

> Streaming a million records to another node for additional processing
> seems a cool thing to do, but really, it is a performance killer.

Such absolute statements only make sense in specific contexts.

1) An example given in the WP article on MapReduce involves processing
1.1 billion records representing people. Each record contains a list
of social contacts and an age. What we want is the average number of
contacts for each age, so we have to reduce the 1.1 billion records to
just 96, one per age. Trying to do this without FBP (of which MapReduce
is a special case) is hopeless without distributed processing. You need
a little more than a thousand CPUs.

2) The New York Times needed to convert 4 TB of raw TIFF images to 11
million PDFs. They used 100 Amazon EC2 boxes and MapReduce to do the
conversion in 24 hours at a cost of only $240 (not counting storage
charges).

3) The Compact Muon Solenoid, part of LHC at CERN, generates about 40 TB/s
of raw data, but most of these event are uninteresting. Consequently,
a two-stage FBP pipeline reduces the data first to 50 KB/s (this part is
done by custom hardware based on FGPA) and then to 500B/s. Even this
cannot be analyzed in real time, so it is saved to tape and processed
offline.

> This is true no matter what sort of processing you are doing.

Not.
Don't be so humble. You're not that great.
--Golda Meir

John Cowan

unread,
Oct 5, 2015, 10:20:47 AM10/5/15
to flow-based-...@googlegroups.com
Humberto Madeira scripsit:

> FBP has no particular feature in it to handle distributed programming.

It has none because it needs none. An FBP framework that executes
different components on (locationally) different computers is very
sensible, particularly if the inputs are available on one machine and
the outputs are needed at another, and there is enough data that the
processing cost overwhelms the bandwidth requirement. See my previous post.
Half the lies they tell about me are true.
--Tallulah Bankhead, American actress

John Cowan

unread,
Oct 5, 2015, 10:38:17 AM10/5/15
to flow-based-...@googlegroups.com
Ged Byrne scripsit:

> The real world is not being added, it is a vital part of what an IP is. It
> is an analogue for a real world object.

All an IP is, is a unit of data to be processed. The framework does not
care whether a component drops its IPs (and possibly create new ones)
or whether it passes them on unmodified. If the first strategy were
prohibited, it would be difficult to create components in an FP language.
If the second strategy were forbidden, it would place an unnecessarily
limitation on implementations in imperative languages.

Since the framework never examines the payloud, this flexibility is
perfectly safe and component-specific.
May the hair on your toes never fall out! --Thorin Oakenshield (to Bilbo)

Ged Byrne

unread,
Oct 5, 2015, 11:08:53 AM10/5/15
to flow-based-...@googlegroups.com
Hi John,

An IP is a little more because it can have an identity that persists beyond a single process.

The framework does track an IP's identity from creation to deletion. If every process drops and creates instead of sending then this does make a difference, conceptually.

Exploiting this feature is entirely optional, but I find it very valuable.

Regards,



Ges

Humberto Madeira

unread,
Oct 12, 2015, 4:19:54 PM10/12/15
to Flow Based Programming, co...@mercury.ccil.org
Hi John,


On Monday, 5 October 2015 10:18:00 UTC-4, John Cowan wrote:
Humberto Madeira scripsit:

> Streaming a million records to another node for additional processing
> seems a cool thing to do, but really, it is a performance killer.

Such absolute statements only make sense in specific contexts.

1) An example given in the WP article on MapReduce involves processing
1.1 billion records representing people.  Each record contains a list
of social contacts and an age.  What we want is the average number of
contacts for each age, so we have to reduce the 1.1 billion records to
just 96, one per age.  Trying to do this without FBP (of which MapReduce
is a special case) is hopeless without distributed processing.  You need
a little more than a thousand CPUs. 

The fact is, passing 1.1 billion records back and forth is a performance killer.

Just reading 1 billion records into memory so you can read them will kill enough performance (you don't need the extra 100 million to prove anything)

Generally, you will want to spread out the cost of reading (never mind storage) over multiple spindles on multiple hosts.

Preferably, if you anticipated the query in advance, you will have it precalculated as aggregate records of some form or other.
(you could probably get by with a lot less CPU's than a thousand)

Then you just have to apply a query to each host and collect and process the results.

Hadoop can be one way to perform the distributed query and aggregate the results, but Hadoop is not FBP. 

Hadoop can be used with FBP. Or it can be used without FBP (more often than not - without).  
The volume of output from individual hosts should be about the same.

Regardless of how you do it, whatever filtering you do filtering needs to be done on the individual hosts near the source data as early as possible.
This is done in order to minimize the number of records being transferred and just goes to prove my point.


2) The New York Times needed to convert 4 TB of raw TIFF images to 11
million PDFs.  They used 100 Amazon EC2 boxes and MapReduce to do the
conversion in 24 hours at a cost of only $240 (not counting storage
charges).

At T3 speeds, it takes almost the full 24 hrs just to transmit one terabyte, never mind all 4.
And then the images had to be transmitted back.

Even assuming the converted return images included a lot of compression to the point they can be ignored, 
that's still almost 4x as much time to transfer the data as it took to do the actual calculations.

And for those of us with monthly caps - it might be better to overnight Fedex a 4 TB hard drive 
(actually, it would still be better even in your sample case)

Oh and mapreduce still has nothing to do with FBP.

My point still stands.

3) The Compact Muon Solenoid, part of LHC at CERN, generates about 40 TB/s
of raw data, but most of these event are uninteresting.  Consequently,
a two-stage FBP pipeline reduces the data first to 50 KB/s (this part is
done by custom hardware based on FGPA) and then to 500B/s.  Even this
cannot be analyzed in real time, so it is saved to tape and processed
offline.

So where is the part where they call out with all this data over the network so it can be processed?

I guess it was too expensive to do in real time.

In any case, the whole point of the data reduction pipeline in the custom hardware is to reduce the amount of data to be transmitted,
and eventually stored, as early as possible in the acquisition stream.

Which is exactly my point.

Regards,
--Bert 

Raoul Duke

unread,
Oct 12, 2015, 4:25:20 PM10/12/15
to flow-based-...@googlegroups.com
> The inherent problem of automatic parallelism is that even if we have n
> cores available, any shared resource becomes the bottleneck, and in
> networked distributed systems it usually is the network throughput, or hard
> disk access.

Yeah! Is anybody on this list well connected with angel funding? Ever
since working on OpenStack Swift (and such ilk) I have this
hare-brained system design idea aimed at working around such
limitation in a truly massively parallel manner. :-)

John Cowan

unread,
Oct 12, 2015, 7:02:39 PM10/12/15
to Humberto Madeira, Flow Based Programming
Humberto Madeira scripsit:

> The fact is, passing 1.1 billion records back and forth is a performance
> killer.

That's like saying "Spending a billion dollars is a profit killer." It
depends on what that billion dollars buys you. If what you want is a
heavy-lift rocket to the Moon, it's cheap.

> Hadoop can be one way to perform the distributed query and aggregate the
> results, but Hadoop is not FBP.

Hadoop is indeed FBP, a particular instance of an FBP network.
Híggledy-pěggledy / XML programmers
Try to escape those / I-eighteen-N woes;
Incontrovertibly / What we need more of is
Unicode weenies and / François Yergeaus.

Paul Morrison

unread,
Oct 13, 2015, 1:50:04 PM10/13/15
to Flow Based Programming, co...@mercury.ccil.org
Interesting examples, John!   And, as you say, very amenable to FBP.

I understand the overhead of transmitting data across a network, but this is just one of the options available to an FBP application designer.  Conceptually, you build a very high level network, and then, at some point, you decide to spread the network across the available hardware, adding "communication processes" as required. In my book (Chap. XV, 2nd ed.), I call this process "cleaving".   By the way, in FBP this is not an irreversible process, unlike conventional programming. 

Quote from the chapter: "When you think about the ways a flow can be chopped up, remember you can use anything that communicates by means of data, which in data processing means practically any piece of hardware or software."

Of course, you try to minimize the amount of data being transmitted over the slower paths, but IMO this should be the designer's decision.

Bert, Ged, Alfredo, Ken, do all/any of you feel this should be done automatically?   If so, I have to say this goes against my experience.  That's not to say, however, that judicious use of measurement and tracking tools cannot provide useful guidance!

Regards,

Paul

Raoul Duke

unread,
Oct 13, 2015, 2:51:14 PM10/13/15
to flow-based-...@googlegroups.com, co...@mercury.ccil.org
"transmitting data across a network" should indeed be seen
metaphorically. Consider how Erlang lets you go from running
everything on a single machine, to a distributed model, very quickly
since you were already supposed to use a style that was communicating
components. So sometimes the 'network' is the memory system (including
cache, numa, ram, etc.) of a single machine.

Ged Byrne

unread,
Oct 13, 2015, 3:48:46 PM10/13/15
to flow-based-...@googlegroups.com
Hi Paul,

I've been giving this a lot of thought lately.

I don't think that automatic is necessarily the answer.  

What is needed is the ability to abstract away the details of connectivity.

Lets say we have four types of connector: ReadFile, Output, Transmit and Recieve.

ReadFile reads characters from a file.  Output prints characters to a console.  Transmit sends characters across a network connection.  Receive receives characters across a network connection.

Functionally the following two graphs are identical:

[a] ReadFile -> Output
[b] ReadFile -> Transmit -> Receive -> Output

The only difference is that in the graph [b] the file and the console might be on separate machines.

Ideally I want to be able to take [a] and easily convert it into [b].  I'd also like to abstract away the network elements of [b] and view it as [a].

It would also be good to have a mechanism for tracking the lifetime of an IP across machines.

At the moment Transmit would have to DELETE it's IPs and Receive would have to CREATE new ones.  It would be good to have the ability to SEND an IP over the network.

Regards, 


Ged


--

Raoul Duke

unread,
Oct 13, 2015, 3:50:33 PM10/13/15
to flow-based-...@googlegroups.com
> Functionally the following two graphs are identical:
>
> [a] ReadFile -> Output
> [b] ReadFile -> Transmit -> Receive -> Output
>
> The only difference is that in the graph [b] the file and the console might
> be on separate machines.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

Ged Byrne

unread,
Oct 13, 2015, 4:12:27 PM10/13/15
to flow-based-...@googlegroups.com
Raoul,

I said 'functionally' - Deutsh's fallacies are all concerned with non-functional aspects.

It is because of these non-functional aspects that I would not like to see an automatic approach.  Automation makes the distribution invisible.

Fallacy 5 - the changing topology - is another important aspect.  As the topology changes we might want to change the position of the Transmit and Recieve components while maintaining the functional behavior. 

Lets add two more component types: DoSomething and DoSomethingElse.

[a] ReadFile -> DoSomething -> DoSomethingElse -> Output
[b] ReadFile -> Transmit -> Receive -> DoSomething -> DoSomethingElse -> Output
[c] ReadFile -> DoSomething -> Transmit -> Receive -> DoSomethingElse -> Output
[d] ReadFile -> Transmit -> Receive ->  DoSomething -> Transmit -> Receive -> DoSomethingElse -> Transmit -> Receive -> Output

All of these graphs are FUNCTIONALLY equivalent, but their topology is quote different.  By the time we get to [d] the FUNCTIONAL purpose of the graph is very difficult to see.

It seems to me that with FBP it would be possile to 
  1. Apply refactoring transformations to graphs that will alter the distribution while preserving functionality.
  2. Prove that two graphs are functionally equivalent.
  3. Provide views at different levels of detail and from different perspectives.

Regards, 


Ged

Raoul Duke

unread,
Oct 13, 2015, 4:32:17 PM10/13/15
to flow-based-...@googlegroups.com
> I said 'functionally' - Deutsh's fallacies are all concerned with
> non-functional aspects.

I always think of the main point of FoDC to be that people try to
separate out the "functionally" stuff and do some hand waving pixie
dust magic RPC whatever, and then when the poop hits the fan they
wonder why. Some claim that Erlang purports to get around this in
various ways, but I think people over-sell it a tad.

Ged Byrne

unread,
Oct 13, 2015, 4:36:15 PM10/13/15
to flow-based-...@googlegroups.com
Separating out the 'functionality' = hand waving pixie dust magic?

Sorry, I'm not buying that one.


Paul Morrison

unread,
Oct 13, 2015, 4:36:37 PM10/13/15
to flow-based-...@googlegroups.com
Hi Ged,

On Tue, Oct 13, 2015 at 3:48 PM, Ged Byrne <ged....@gmail.com> wrote:

I don't think that automatic is necessarily the answer.  

Agree!

 

What is needed is the ability to abstract away the details of connectivity.

Lets say we have four types of connector: ReadFile, Output, Transmit and Recieve.

ReadFile reads characters from a file.  Output prints characters to a console.  Transmit sends characters across a network connection.  Receive receives characters across a network connection.

Functionally the following two graphs are identical:

[a] ReadFile -> Output
[b] ReadFile -> Transmit -> Receive -> Output

The only difference is that in the graph [b] the file and the console might be on separate machines.

Ideally I want to be able to take [a] and easily convert it into [b].  I'd also like to abstract away the network elements of [b] and view it as [a].

It seems to me that all of the above can be covered by a sufficiently smart diagramming tool!

DrawFBP has the ability to "explode" a subnet (shown as a double-line block at the next higher level), so it sounds to me as if we could use a very similar mechanism to "explode" a line!  DrawFBP has a number of line attributes so "connection" attributes could be set on or off, or modified.  The chosen option could then be marked on the line using some little marker, just as dropOldest displays as a zigzag line.

If you "explode the line", it will show as a "send end" process, a "receive end" process, with something in between, e.g. a file or lightning flash icon.  Sounds pretty straightforward to me!  The line attributes could in fact be table-driven, making it easy to add new connect options.

BTW Have you tried out DrawFBP?  It would be good to get feedback from someone with your view of FBP... :-)
 

It would also be good to have a mechanism for tracking the lifetime of an IP across machines.

Agree!  I need to think about that...

At the moment Transmit would have to DELETE it's IPs and Receive would have to CREATE new ones. 

I don't why that would be a problem, unless you want to transmit IP trees....
 
It would be good to have the ability to SEND an IP over the network.

Hunnh?  I can do that now!  :-)  Actually I can do that using IPs whose contents are Strings - just have to add a JSON (or other format) expander/contracter for non-String formats.  I'm sure that can be obtained off the shelf - except that it probably involves reflection...  Does this raise overhead concerns?

Regards,

Paul
 

Raoul Duke

unread,
Oct 13, 2015, 4:40:53 PM10/13/15
to flow-based-...@googlegroups.com
On Tue, Oct 13, 2015 at 1:36 PM, Ged Byrne <ged....@gmail.com> wrote:
> Separating out the 'functionality' = hand waving pixie dust magic?
> Sorry, I'm not buying that one.
> https://en.wikipedia.org/wiki/Separation_of_concerns


It is of course reasonable to try to separate concerns. But all
abstractions are leaky. Some more dramatically than others. That's
all.

Ged Byrne

unread,
Oct 13, 2015, 4:49:11 PM10/13/15
to flow-based-...@googlegroups.com
Hi Paul,

I'm talking about the IP identity spanning multiple machines.

I'd like to know that if I read N IPs from the file then N IPs will be written to the console, regardless of how many machines they travel across to get there.

Regards, 


Ged

Ged Byrne

unread,
Oct 13, 2015, 5:07:44 PM10/13/15
to flow-based-...@googlegroups.com
Agreed.  Personally I've always preferred Box's 'All Models are wrong' to Spolsky's 'All abstractions are leaky.'

"Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity."

... and ...

"For such a model there is no need to ask the question 'Is the model true?'. If 'truth' is to be the 'whole truth' the answer must be 'No'. The only question of interest is 'Is the model illuminating and useful?'."

https://en.wikipedia.org/wiki/All_models_are_wrong

Raoul Duke

unread,
Oct 13, 2015, 5:12:27 PM10/13/15
to flow-based-...@googlegroups.com
For me I tend to prefer the leaky one because that is really how it
feels when you start to hit the problems. As if the main sewer line is
leaking. It really emphasizes just how wrong you can go when you try
to think things are separable. BDUF is not dead. etc. in some parallel
universe I wish I knew how to inhabit.

Paul Morrison

unread,
Oct 13, 2015, 9:38:49 PM10/13/15
to flow-based-...@googlegroups.com
See my comments on the waterfall approach in the above-referenced Chap. XV of the 2nd ed. of my book.  Here is a quote from the book:

'By way of comparison, I believe that the chief failing of the “waterfall” methodology was that it was so awkward to go back that development teams would press forward phase after phase, getting deeper and deeper into the swamp, or they would in fact do some redesign, but pretend they were doing development (or even testing!). Here is Dave Olson again: “Programmers know that highly detailed linear development processes don’t match the way real programming is done, but plans and schedules are laid out using the idea, anyway.”'

You don't need a BDUF - you do need to sketch out the design, prototype the bits you are unsure about, and then extend outwards, or from the back end backwards, adding function as needed.  At any time, you can move processes around, re-cleave the overall network, etc.   BDUF is very much an attempt to compensate for the rigidity of the old programming technology.

Regards,

Paul


--


Ged Byrne

unread,
Oct 14, 2015, 1:23:09 AM10/14/15
to flow-based-...@googlegroups.com
Hi Paul,

This is why I like the Box quote.  The problem isn't big design, it is mediocre design with its mark of "overelaboration and overparameterization".  These are the qualities that bloat the design.

What we need is to develop the "ability to devise simple but evocative models."

What I particularly like is that Box isn't talking about code, he is talking about good science and engineering.  Code coding is good engineering, good science.  

It bothers me that industry has rejected engineering rather than striving to master the practices of good engineering.

Regards, 


Ged

On Wed, 14 Oct 2015 at 02:38 Paul Morrison <jpau...@gmail.com> wrote:
[...]

You don't need a BDUF - you do need to sketch out the design, prototype the bits you are unsure about, and then extend outwards, or from the back end backwards, adding function as needed.  At any time, you can move processes around, re-cleave the overall network, etc.   BDUF is very much an attempt to compensate for the rigidity of the old programming technology.

Regards,

Paul

Paul Morrison

unread,
Oct 14, 2015, 10:57:00 AM10/14/15
to flow-based-...@googlegroups.com
On Wed, Oct 14, 2015 at 1:22 AM, Ged Byrne <ged....@gmail.com> wrote:

It bothers me that industry has rejected engineering rather than striving to master the practices of good engineering.


Do you have examples of that?  I certainly agree that a lot of what is being discussed on this group seems to me to be more like abstract research than engineering, but I am curious if you have more industry-wide examples.  E.g. one example might be the JavaScript's community apparent unwillingness to embrace node-fibers...?  ;-)

Regards,

Paul

Humberto Madeira

unread,
Oct 14, 2015, 11:32:16 AM10/14/15
to Flow Based Programming, co...@mercury.ccil.org
Hi Paul,

On Tuesday, 13 October 2015 13:50:04 UTC-4, Paul Morrison wrote:
Interesting examples, John!   And, as you say, very amenable to FBP.

I understand the overhead of transmitting data across a network, but this is just one of the options available to an FBP application designer.  Conceptually, you build a very high level network, and then, at some point, you decide to spread the network across the available hardware, adding "communication processes" as required.

When I design distributed systems, I prefer to first look at the data that needs to be stored, 
its domain structure, and its classification (if necessary) and break it down into multiple back end servers based on that.

Front end servers then need to be added (in parallel) depending on how many users need to access the system.

I then try to design the flow network on top of that.

For those users that need large computations beyond what would be reasonable on a single front end server, 
I believe it could also be reasonable to offload to one or more generic "intermediate computation" servers.

However, it is pretty rare that there is enough of a pure computation load to need offloading.

One time I've tried it - it turned out to be more reasonable to not offload.
In that particular case, the time cost of transmitting the data exceeded the benefit gained by breaking up the load.

In another case, the offloading worked because the data transmission was kept small.

In my book (Chap. XV, 2nd ed.), I call this process "cleaving".   By the way, in FBP this is not an irreversible process, unlike conventional programming. 

Quote from the chapter: "When you think about the ways a flow can be chopped up, remember you can use anything that communicates by means of data, which in data processing means practically any piece of hardware or software."

Of course, you try to minimize the amount of data being transmitted over the slower paths, but IMO this should be the designer's decision.
 
The problem is that we are now in the age of DevOps.  

Either the system should adjust automatically, or the system operators need to be able to influence the configuration based on the available hardware.

Requiring designer intervention to adjust system configuration is not generally a good idea, and it goes against one of the main principles of DevOps.


Bert, Ged, Alfredo, Ken, do all/any of you feel this should be done automatically?   If so, I have to say this goes against my experience.  That's not to say, however, that judicious use of measurement and tracking tools cannot provide useful guidance!
 
Perhaps I may be thinking contrary to the general sentiment, but I do believe that it should be done automatically if at all possible.
Failing that - it should be able to be configured by the DevOps guys if at all possible.

In my own framework, I haven't implemented all of them yet, but I have identified some general heuristics (similar to how databases optimize queries)
that can be applied automatically to flow sequences to reduce overall data transmission.

But that involves using concepts outside of the scope of a purely flow approach.
In a pure flow approach, such as classic FBP, I would probably just leave it up to the designers.

I believe it's really a question of how much a "framework" is able to see and control.
(this is one of the reasons why my own framework is not purely a flow engine)

Regards,
--Bert

Raoul Duke

unread,
Oct 14, 2015, 1:52:03 PM10/14/15
to flow-based-...@googlegroups.com, co...@mercury.ccil.org
> In my own framework, I haven't implemented all of them yet, but I have
> identified some general heuristics (similar to how databases optimize
> queries)
> that can be applied automatically to flow sequences to reduce overall data
> transmission.

Do tell more, if you would be willing to. :-)

Raoul Duke

unread,
Oct 14, 2015, 1:55:26 PM10/14/15
to flow-based-...@googlegroups.com
Sure, I know what BDUF classically means. I also know what "Agile" is
supposed to mean. I also have rarely seen anything work as well as I'd
personally like. Whatever path is chosen ends up throwing out an awful
lot of babies with the bathwater. I purposefully like to use and
reclaim BDUF to rub people's noses in the fact that I believe there's
more nuance to ponder before jumping into writing code. Especially
when the way people write the code is poor, which also seems to be
more true than not. Even the things I have to use from supposedly
hallowed institutions like Google are frankly crappy.

Tom Young

unread,
Oct 14, 2015, 3:10:20 PM10/14/15
to Flow Based Programming
>Conceptually, you build a very high level network, and then, at some point, you decide to spread the network across the available hardware, ...

An FBP run-time engine could conceivably interpret a network
definition, determine the appropriate nodes(hardware), allocate load
sharing components appropriately, and connect the remote and local
nodes dynamically. This is, in fact, what the DFD facilitates.

This approach relieves the designer from becoming an expert at
predicting the run time environment. It also facilitates testing in
a relatively simple environment.

---twy

Ged Byrne

unread,
Oct 15, 2015, 4:38:43 AM10/15/15
to flow-based-...@googlegroups.com
Hi Paul,

I see there being two key gaps.
Regarding the first, it's interesting that you say that the discussions sound like abstract research, because I'll bet that most of those involved are practitioners.

A good example is the current move away from RDBMS to NoSQL solutions.  Relational Databases have proven themselves in production for decades, and they are supported by a solid body of knowledge based on serious research.  Yet the NoSQL community rejects all of this to pursue new technologies.

NoSQL has come from areas where the RDBMS does not fit, and discovering new solutions is sensible, but in my day job I face developers and architects building traditional business applications that the RDBMS is ideal for but they are pursuing NoSQL solutions just because they are new and SQL is old.

The classic, proven knowledge is being rejected in order to pursue fads that are:
  • Simple (No need to design or normalise)
  • Prescriptive (SQL bad, NoSQL good)
  • Falsely encouraging (We can build an app in a day)
  • One size fits all (We need to use NoSQL, now what are your requriements?)
  • Easy to cut-and-paste (... in 5 minutes tutorials)
  • In tune withe the zeitgeist (NoSQL is more Agile, DBAs are so Waterfall!)
  • Novel, not radical (Lots of shiny new technologies.  No new ideas, just PIC or BerkleyDB updated for the Internet age)
  • Legitimised by gurus and disciples (Facebook and Google are doing it, so should we.  It doesn't matter that we are in neither the social media or search businesses)

I'm glad to say that this group contains free thinkers rather than fad chasers.  I wish the same was true for the day job.

Here's a blog post \I wrote about this a few years ago: https://softwareflow.wordpress.com/2012/05/12/facts-not-opinions/

Regards,


Ged

Ged Byrne

unread,
Oct 15, 2015, 7:28:49 AM10/15/15
to flow-based-...@googlegroups.com
I should qualify - when I say 'day job' I mean the industry.  I'm glad to say that I do get to work with some smart people who in my current role.

Humberto Madeira

unread,
Oct 15, 2015, 11:00:29 AM10/15/15
to Flow Based Programming, co...@mercury.ccil.org
Hi Raoul,
OK.  I'll bite.  Here's one of the database-like ones..

So in my framework, the "data IPs" are restricted to immutable tuples (TupleValues) very similar to a row of data in a row set of some kind.  
Each TupleValue conforms to a specific Model, which not only describes the structure of the tuple, but also provides a factory for creating it, and task support for dealing with storage queries.

Models can have simple data storage (like tables) or they can represent a Join of two or more other Models.

When a join-type Model is queried, the brute force approach (which works in all cases) is to just execute a query on each of the underlying Models,
then use flow operations to join and filter the results according to the join definition and the specific query.

However this brute force approach will result in a lot of data flowing, just to toss it away again.
(to minimize flow, the ideal rule is to always filter closest to the data source)

So there are specific heuristics that can be applied in certain circumstances

1) the other Models belong to the same DataStore (which also supports joins)
In this case, the incoming query will be converted so that a join query will be pushed directly into the underlying DataStore.
This leaves it up to the underlying storage engine to resolve, but its output will be reduced to a single stream filtered by what's in the query 
(so far this is mostly just like any normal SQL program using automatically generated join queries)

2) the other models belong to different DataStore (or one which does not support joins)
In this case, a flow network is created such that the incoming query is broken up into pieces that each get sent to one of the underlying DataStores.
The separate queries are made against each DataStore with resulting flows that get merged in a join Operation to produce a final output flow (this is an in memory-merge, with filtering at the DataStore).
This also minimizes the data flow from the underlying DataStores, but as multiple streams instead of just one.

3) one of the incoming query components can't be translated into the query set to the underlying DataStore
In this case, the resultant flow network will need to pull slightly more data from the underlying DataStore(s) 
but there will be a filter operation added to the output of the resultant flow just for the purpose of filtering out the non-translatable query component. 
This also minimizes the data flow from the underlying DataStores to whatever extent it can, 
but also acts as a way to enhance the filtering capability against that particular model in ways that the underlying DataStore could not.

The upshot of all of these adaptive heuristics is that they can be deduced purely by examining the Model structure definitions,
BUT it requires that the framework be sufficiently aware of that Model structure in such a way as to be able to automatically create the correct adaptive flow network for each Model.

Without the ability to go through the metadata programmatically, all of this would have to be done manually.
Which is OK for experienced programmers (when they think of it and are given time to implement it), but less so for people just starting out.

I hope this is enough of a taste to clarify what I meant.

BTW, there are plenty of books on database heuristics out there.  I'm sure at least some of them would be applicable.

Regards,
--Bert

Raoul Duke

unread,
Oct 15, 2015, 1:13:47 PM10/15/15
to flow-based-...@googlegroups.com
word +1 you are hired!

:-( that the world is like this.
:-) that there are places where people aren't like this.

Raoul Duke

unread,
Oct 15, 2015, 1:14:34 PM10/15/15
to flow-based-...@googlegroups.com
Wow, thanks. Trying! To! Digest! ...

Paul Morrison

unread,
Oct 15, 2015, 1:16:52 PM10/15/15
to flow-based-...@googlegroups.com
Hi Ged, your quote from Jeff Atwood, "Software development is only like bridge building if you’re building a bridge on the planet Jupiter, out of newly invented materials, using construction equipment that didn’t exist five years ago" brought a tear to my eye - well, almost!  Because building apps using FBP is precisely the opposite: you're building a bridge in the here and now, using a technology that has been around for 40+ years, with a lot of useful components available, with several decades of accumulated experience behind it, and a proven track record of building reliable and maintainable systems!

A lot of the gurus seem to want to prove that FBP is inappropriate for GUIs (although the jury is still out) - and maybe it is, but why not use it for all the stuff that it is well-adapted for - and contribute more reusable components to the store, while you are at it?!

Regards,

Paul

Ged Byrne

unread,
Oct 15, 2015, 1:23:41 PM10/15/15
to flow-based-...@googlegroups.com, co...@mercury.ccil.org
Hi Bert,

This sounds amazing.  I hope we get to learn more about it.

Regards, 


Ged

--

Ged Byrne

unread,
Oct 15, 2015, 1:27:18 PM10/15/15
to flow-based-...@googlegroups.com, co...@mercury.ccil.org
Regarding doing it automatically....

I don't think that the distribution should be hidden within the flow graph, hidden away in the engine.  It should be a component within the graph.  There should be enough metadata to filter it out if desirable, but it should be fully accessible.

Automatically generating the flow graph based, that's a different thing entirely.  I think that approach has a lot of promise.  As long as I can see the full graph, just as I do when viewing an SQL execution plan.

How far is that from the approach you are taking?

Regards, 


Ged

John Cowan

unread,
Oct 15, 2015, 1:53:22 PM10/15/15
to Ged Byrne, flow-based-...@googlegroups.com
Ged Byrne scripsit:

> I don't think that the distribution should be hidden within the flow graph,
> hidden away in the engine. It should be a component within the graph.

That's a framework-specific choice. Some frameworks may only support
distribution (max one process per CPU), as when using lots of itty-bitty CPUs.
In that case, requiring send/receive components throughout the graph
would be nothing but noise.

--
weirdo: When is R7RS coming out?
Riastradh: As soon as the top is a beautiful golden brown and if you
stick a toothpick in it, the toothpick comes out dry.

Humberto Madeira

unread,
Oct 15, 2015, 10:20:00 PM10/15/15
to Flow Based Programming, co...@mercury.ccil.org

Hi Ged,

On Thursday, 15 October 2015 13:27:18 UTC-4, Ged Byrne wrote:
Regarding doing it automatically....

I don't think that the distribution should be hidden within the flow graph, hidden away in the engine.  It should be a component within the graph.  There should be enough metadata to filter it out if desirable, but it should be fully accessible.

Automatically generating the flow graph based, that's a different thing entirely.  I think that approach has a lot of promise. 

I'd certainly like to believe so :-) 
 
As long as I can see the full graph, just as I do when viewing an SQL execution plan.

How far is that from the approach you are taking?

A generated adapter network is the same as any other flow graph at run time.
At design time, it looks like a simple component.  Comparing the two implies drilling down from one into the other (an expansion).

Other heuristics might show a logical network at design time, but morph to a different network at run-time..
The comparison might not be as simple as expansion.

What is important to the designer (what he/she has to maintain) is the logical network, not the "physical" rendering.

If the heuristic generation process was itself adjustable, you might potentially have many physical implementations of the same logical network.
None of them very interesting to the designer (sort of the way compilers perform code optimization)

The DevOps people however (not to mention the QA's), would potentially be more interested in the physical network.

However, at present, I am not sure how much latitude they should have over the physical flow network.
I have a certain leaning, but I am interested in hearing other opinions.

And all that before even considering the issue of security and how it should impact visibility...

Regards,
--Bert


Ged Byrne

unread,
Oct 16, 2015, 1:33:21 AM10/16/15
to Flow Based Programming, co...@mercury.ccil.org
Fascinating stuff.

I hope you can share more in the future.

Humberto Madeira

unread,
Oct 16, 2015, 1:05:33 PM10/16/15
to Flow Based Programming, kuna...@gmail.com, co...@mercury.ccil.org
Hi John,

My apologies for not getting back to this quicker - my week has been rather busy.

I took some time to look up some information on Hadoop - my last look at it was getting rather old.

On Monday, 12 October 2015 19:02:39 UTC-4, John Cowan wrote:
Humberto Madeira scripsit:

> The fact is, passing 1.1 billion records back and forth is a performance
> killer.

That's like saying "Spending a billion dollars is a profit killer."  It
depends on what that billion dollars buys you.  If what you want is a
heavy-lift rocket to the Moon, it's cheap.
 
Actually, the SpaceX Falcon heavy is projected to cost $125 million per moon shot - it's cheaper.
In the article, there's even an example in the article about shipping fuel to orbit for profit - a billion dollars per shot would make it very non-profitable.

If you're thinking of spending a billion dollars, it really helps if you shop around a bit first.
 

> Hadoop can be one way to perform the distributed query and aggregate the
> results, but Hadoop is not FBP.

Hadoop is indeed FBP, a particular instance of an FBP network.



I believe your first example with the 1.1 billion records is in the Wikipedia aricle on MapReduce.
However, if you read the article, it is merely given as a thought example/exercise

"As another example, imagine that for a database of 1.1 billion people"

So this was never actually attempted - we have no idea of what the performance would actually be.

For your second example,  I believe it is mentioned here.  And it did use Hadoop.
But the motivation seemed to to be more to "just try it" rather than any sort of cost or time calculation.

"Pretty quickly I thought about how we could do this (and have some fun along the way)"

And there was no comparison with any possible alternatives.

But regardless of that, if you go look at the Wikipedia article on MapReduce, you will not find one mention of FBP.

The article does allude to some data streaming (Input Reader and Output Writer), but AFAIK, 
simple data streaming (at least, based on the rough agreement in this forum) is not considered to be FBP.

Perhaps someone else reading this might confirm or deny that opinion?

Looking at Apache's page on Hadoop however, we do see that it contains 2 components: Spark and Tez that both seem to be some sort of FBP.
(I don't have the time to fully confirm that, so I will leave it as an "exercise to the interested reader")

But here's the thing.  The earliest builds of either Tez or Spark only happened in 2014

The 2nd example happened in 2007 and the 1st (thought) example was only added to the MapReduce article in 2013.

Neither of them could have been referring to either Tez or Spark.

So, no FBP in either of the two Hadoop examples.

BTW, as an aside for fun, you may also want to read this article on the "post-Hadoop era"

 Best Regards,
--Bert

John Cowan

unread,
Oct 16, 2015, 1:31:01 PM10/16/15
to Humberto Madeira, Flow Based Programming
Humberto Madeira scripsit:

> Actually, the SpaceX Falcon heavy is projected to cost $125 million per
> moon shot <http://www.cislunarone.com> - it's cheaper.

I was actually talking about building such a rocket from scratch, not
per shot. The Saturn V was much more expensive than that!

> But regardless of that, if you go look at the Wikipedia article on
> MapReduce, you will not find one mention of FBP.

There are lots of things that are instances of FBP even though people
don't realize it.

Architecturally, MapReduce partitions input data into substreams (often
implicitly by where it is stored), filters each IP in the stream,
sends the IPs intelligently to reduce processes, which then collapse
them into single (or few) results. That is obviously representable as
an FBP network with mapper and reducer components.
No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friends or of thine own were: any man's death diminishes me,
because I am involved in mankind, and therefore never send to know for
whom the bell tolls; it tolls for thee. --John Donne
Reply all
Reply to author
Forward
0 new messages