Large amount of shared immutable data.

79 views
Skip to first unread message

toivo...@gmail.com

unread,
Jun 11, 2016, 6:49:27 AM6/11/16
to Flow Based Programming

FBP usual approach is to exchange data between components.

Data can be any, from few bytes to enormous amount of data.


But sometimes it might be all data is not needed, some references to data is sufficient.


For example we might have large data database.

First component will do some initial scan and find range of records.

Component output is range start and end references.

Next component will do something with records in range, but actually not all records in range are needed.

So its not wise to read all records in first component and send them to second component.

This might be even not possible, amount of data might be too large.


I think using references instead of actual data is OK.

Of course this works when data is read only and doesn't change during processing.


I have a feeling sharing data this way is not FBP style or even bad practice?


What do you think?

Are there better ways?


Thanks

Toivo

Alexander Harchenko

unread,
Jun 11, 2016, 8:21:49 AM6/11/16
to Flow Based Programming
My opinion is.
It all depends on the planning application architecture. Assuming that every block Diagram function is written by the end user, all depending on the data structure. If you work with an image, it will transmit  path to the image. If you work with the database records, it is better to transmit unique identifiers records between blocks. In this case, each unit can download data directly from the database.

Alfredo Sistema

unread,
Jun 11, 2016, 9:43:43 AM6/11/16
to Flow Based Programming
Classical FBP seems to favor this, passing around a reference/pointer and not a full copy per send, otherwise you'd need a very large ammount of memory and time spent making copies for trivial graphs.
If you have special requirements you can also pass around a key to access shared data in some resource, for example a database or a session, so you don't need to pass around everything or deal with changes, since you can always ask for the freshest data at the start of your operations.

El sáb., 11 jun. 2016 a las 9:21, Alexander Harchenko (<alexandrha...@gmail.com>) escribió:
My opinion is.
It all depends on the planning application architecture. Assuming that every block Diagram function is written by the end user, all depending on the data structure. If you work with an image, it will transmit  path to the image. If you work with the database records, it is better to transmit unique identifiers records between blocks. In this case, each unit can download data directly from the database.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Morrison

unread,
Jun 11, 2016, 10:41:13 AM6/11/16
to flow-based-...@googlegroups.com
Hi Toivo, maybe I misunderstood, but what you describe seems to me totally under the control of the application designer.  Classical FBP doesn't seem to me to constrain how data is passed from one process to another.

Within a machine, most data transfers are technically by reference, which is why classical FBP has ownership rules for data - an IP can only be owned by one process at a time, or be in transit on a connection, and a process definitely should not modify data that it does not own, even if it somehow gets hold of the reference.  At deactivation, an FBP  process must have positively disposed of all owned IPs, which is why there is an argument for deactivating a process more frequently, rather than less.

Between machines, I would serialize the data using some convention, e.g. JSON, protocol buffers, etc. - or pass URLs addressing data out on the cloud. 

Hope I haven't confused you more...

Paul M.

--

Paul Morrison

unread,
Jun 11, 2016, 10:43:15 AM6/11/16
to flow-based-...@googlegroups.com
Hi Alfredo and Alexander, I think we are all saying the same thing!  :-)

toivo...@gmail.com

unread,
Jun 11, 2016, 11:22:26 AM6/11/16
to Flow Based Programming
Thank You all for your feedback.

I misused reference meaning.
Often data transfers are technically by reference, as Paul said.

I wanted to say that component will not receive all data (be it by reference, direct data, copy, etc).
Instead only some pointers to data will be sent between components. (let's use pointer instead of reference)

In case of database pointers might be Primary keys.
Or in case of large list pointers might be element indexes.

As I understand FPB does not is stop such approach.

But I am little bit concerned how shared data source fits to picture.

For example in case of database we have somewhere database component which gives access to data in database. And other other components must have access to database component.
Does flow need to include this shared database component?

Shared database component is some sort of shared resource.
And hundreds flow components may need access to it.

Possibly we don't describe such a shared resource in flow at all.
Each component internally just call shared recourse directly.
Hidden, invisible calls in flow?

Or can we describe shared resources in flow somehow?

Hopefully I did describe my concerns better this time.

Thanks
toivo

Alfredo Sistema

unread,
Jun 11, 2016, 11:53:42 AM6/11/16
to Flow Based Programming
I believe that a process needs to be responsible of a single process, or maybe just a few. For example a process that returns the result of a premade query, or a process that updates the database based on incoming packets. Don't make a process that creates a connection and then passes it around the graph, it is not very convenient, might as well pipe the updates to a process that has that responsibility. It could be a database or a data structure, or a display, it doesn't matter.
What you could pass around is a token, like a session id or some identification so that updates can be pointed to the right resource.

Humberto Madeira

unread,
Jun 11, 2016, 12:58:40 PM6/11/16
to Flow Based Programming
Hi toivo,

The problem with external references - specially those that are shared is that you will have to involve some sort of locking mechanism.

For database objects (rows), the fact that you are going through some sort of database connection with transactional support will tend to insulate you somewhat.

The main problem however is that some databases escalate row locking to table locking, and once that happens, your performance wil drop dramaticcally.
(I am assuming that each process will be using and maintaining its own independent connection)

The other problem is that some connections are big memory-consuming things.  
Some database engines have a definite limit on number of active connections (a limited connection pool) so you may have to wait for one to become available.
Performance will drop even further.

If you want to work in this way, you may need to look at a non-traditional database (but even they will have performance issues)

If your external reference is a file system, and the referenced objects are files, you will run into file locking issues.

If the external reference is in a cache, and it has its own locking support, it will run a bit slower than without, but you won`t be too badly off.
In-memory references without some kind of locking are a recipe for disaster.

Another issue with external references is that you can create multiple independent (equivalent) references to the same object at the same time.
There is no concept of local ownership possible with external objects - the objects are owned externally, so it is the external owner that manages conflict (if it can)

None of these issues are impossible to manage.  But they require the programmer to be aware of them and to stay on top of things.

Regards,
--Bert

On Saturday, 11 June 2016 11:22:26 UTC-4, toivo...@gmail.com wrote:
Thank You all for your feedback. 

I misused reference meaning. 
Often data transfers are technically by reference, as Paul said. 

I wanted to say that component will not receive all data (be it by reference, direct data, copy, etc). 
Instead only some pointers to data will be sent between components. (let's use pointer instead of reference) 

In case of database pointers might be Primary keys. 
Or in case of large list pointers might be element indexes. 

As I understand FPB does not is stop such approach. 

But I am little bit concerned how shared data source fits to picture. 

For example in case of database we have somewhere database component which gives access to data in database. And other other components must have access to database component. object-flow approach that I use.

John Cowan

unread,
Jun 11, 2016, 9:08:03 PM6/11/16
to flow-based-...@googlegroups.com
Humberto Madeira scripsit:

> The main problem however is that some databases escalate row locking to
> table locking, and once that happens, your performance wil drop
> dramaticcally.

If the ids passed around are used only for reading, that's usually not
a big deal. FBP philosophy would be for having only a single
process per table to do writing, just as you don't want multiple
processes writing to a single file.

--
John Cowan http://www.ccil.org/~cowan co...@ccil.org
Knowledge studies others / Wisdom is self-known;
Muscle masters brothers / Self-mastery is bone;
Content need never borrow / Ambition wanders blind;
Vitality cleaves to the marrow / Leaving death behind. --Tao 33 (Bynner)

Humberto Madeira

unread,
Jun 12, 2016, 1:09:05 PM6/12/16
to Flow Based Programming, co...@mercury.ccil.org
Hi John,

But we weren`t discussing classic FBP.

We were discussing a proposed variation from toivo that didn`t intend to follow the usual philosophy of FBP
and I was pointing out where that might lead to technical problems.

Regards,
--Bert

Raoul Duke

unread,
Jun 12, 2016, 2:24:10 PM6/12/16
to flow-based-...@googlegroups.com, co...@mercury.ccil.org

See things like CRDTs, perhaps.

Humberto Madeira

unread,
Jun 13, 2016, 9:15:17 AM6/13/16
to Flow Based Programming, co...@mercury.ccil.org
Hi Raoul,

On Sunday, 12 June 2016 14:24:10 UTC-4, raould wrote:

See things like CRDTs, perhaps.


CRDTs look pretty interesting, but from what I gathered from a light skim, it looks like they work best with multiple data stores or at least, multiple tables - something that avoids the table locking issue.

Could be worth a try.  I would think that it would depend a lot on what the application needed to do.

Regards,
--Bert

toivo...@gmail.com

unread,
Jun 13, 2016, 10:13:15 AM6/13/16
to Flow Based Programming

Thank You all for your feedback.

It's much clearer now. 

And I received a lot of valuable information.

I try to avoid any modification of data which is obtained using shared resource.
Means data is strictly only read. And data must not change during processing by FBP flow.

So, no locking, no transactions, etc.
Processes (components) will only read data.

At the moment for me open question is:
Where I put shared resource in flow diagram?

Definition:
“FBP defines applications as networks of "black box" processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. FBP is thus naturally component-oriented.”

Shared resource is not ordinary process.
And it’s not connected to other processes by message passing in usual way.
At the same time Shared resource is part of the processing, part of the network (flow).

Thanks
Toivo

Alexander Harchenko

unread,
Jun 13, 2016, 5:05:47 PM6/13/16
to Flow Based Programming

Paul Morrison

unread,
Jun 13, 2016, 10:08:29 PM6/13/16
to Flow Based Programming
Sorry guys, I still don't get it.  FBP networks show processes communicating via data flows passing over bounded buffers. 

Let's say you have a database which you want multiple processes to be able to modify - then this is just like conventional database processing, which we have been doing for decades - see Chap. 19 in my book (2nd ed.) - Synchronization and Checkpoints.  You might find Mike Beckerle's description of Ohua interesting.

Another approach is to put all accesses to this database in one process, which receives read and write requests and processes them sequentially.  With caches and multiplexing, this can perform very well - in fact Facebook's Flux architecture seems to do this (based on their diagram). 

Another technique we used, particularly for tables, was to have one process build the table in the application initialization section, and then put its address in a named global; other processes then access the table in read-only mode.  In our application, we had quite a few tables, which were all preloaded from individual databases.

So I see no need for the data as such to be shown in the diagram, although the processes managing it will be.

Hope this helps

John Cowan

unread,
Jun 14, 2016, 5:32:46 PM6/14/16
to flow-based-...@googlegroups.com
Paul Morrison scripsit:

> Another approach is to put all accesses to this database in one process,

The idea I was suggesting was putting all write accesses in one process,
but allowing all processes to read as needed. So the IP contains a key
which the process can apply to the database if it needs the full data.
Because all processes but one are read-only, there is no significant
concurrency problem.
Raffiniert ist der Herrgott, aber boshaft ist er nicht.
--Albert Einstein

Paul Morrison

unread,
Jun 14, 2016, 10:01:43 PM6/14/16
to flow-based-...@googlegroups.com
Wouldn't you have to block the read accesses to make sure they don't overlap with the write? But of course the reading processes can overlap with each other.  And give the write process some priority... 

John Cowan

unread,
Jun 14, 2016, 10:03:43 PM6/14/16
to flow-based-...@googlegroups.com
Paul Morrison scripsit:

> Wouldn't you have to block the read accesses to make sure they don't
> overlap with the write? But of course the reading processes can overlap
> with each other. And give the write process some priority...

Just so. Multiple readers, single writer is much easier to get right
than multiple readers and writers.
I now introduce Professor Smullyan, who will prove to you that either
he doesn't exist or you don't exist, but you won't know which.
--Melvin Fitting
Reply all
Reply to author
Forward
0 new messages