Exception disconnected data flows

21 views
Skip to first unread message

m.neuma...@gmail.com

unread,
Jun 1, 2014, 6:15:51 PM6/1/14
to stratosp...@googlegroups.com
Hej,

I'm implementing a small library of graph analyse tools. I wrote a wrapper for it the idea is that you specify what analysis you want to run and on what graph and everything is done in a single job.

Some parts share a common pre processing step everyone else just shares the initial DataSets.

When I try to run it I get the following exception (interestingly after the first sub job has finished and produced an output):

Exception in thread "main" eu.stratosphere.compiler.CompilerException: The given program contains multiple disconnected data flows.

QUESTION's:
What does the exception mean in detail, all sub jobs share at least the input DataSets so in my mind they are connected.
Is the structure I tried to build illegal? If so why, and what would be a good alternative to have the same functionality.


cheers Martin

Stephan Ewen

unread,
Jun 1, 2014, 6:32:23 PM6/1/14
to stratosp...@googlegroups.com
Hey!

This is an old check that not two programs are submitted as one batch that really have no overlap. I think we could actually allow that with a minor change in the Scheduler, but the current version relies on the plan being connected.

Can you paste your program (or the essence of it) so we can have a look?

BTW: Currently, no caching happens between jobs, so if you do the following:

DataSet<...> input = env.readTextFile("...");
input.map(...).reduce(...).print();
env.execute();
input.filter(...).flatMap(...).print();
env.execute();

then the input will be read twice.

Stephan

Martin Neumann

unread,
Jun 1, 2014, 6:41:18 PM6/1/14
to stratosp...@googlegroups.com
Hej,

I appended the File.

The structure is very simple:
ExecutionEnvironment.getExecutionEnvironment();
load the DataSets
pre process if needed
for each analysis
if analysis x should be done do it and output it to file
env.execute();

env.execute() is only called once in the very end


cheers Martin


--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/KXTow5OLNIs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Main.java

Stephan Ewen

unread,
Jun 1, 2014, 6:48:20 PM6/1/14
to stratosp...@googlegroups.com
This looks indeed such that it should work. Do you have the exception stack trace?

Stephan Ewen

unread,
Jun 1, 2014, 6:56:37 PM6/1/14
to stratosp...@googlegroups.com
Are multiple of the algorithms you add iterative?

Martin Neumann

unread,
Jun 1, 2014, 7:08:01 PM6/1/14
to stratosp...@googlegroups.com
ComponentDistribution and PageRank are both iterative.
Ok my PageRank implementation is bugged (will ask about that in a separate topic). 

The weird thing is when I activate PageRank I don't get the disconnected Exception instead I fail with PageRank (after InDegreeDistribution and OutDegreeDistribution finish)
If I activate all but PageRank I get the disconnected Exception (after ComponentDistribution has finished) see the trace in the appendix.




On Mon, Jun 2, 2014 at 12:56 AM, Stephan Ewen <se...@apache.org> wrote:
Are multiple of the algorithms you add iterative?

--
stack

Stephan Ewen

unread,
Jun 1, 2014, 10:45:42 PM6/1/14
to stratosp...@googlegroups.com
Hey!

I think I have a patch that solves that (my repository, master branch). Travis is still testing it.

In general, I would suggest to run the algorithms one after another. The way you are writing it schedules them concurrently, which will eat up a lot of resources at the same time, or scale the resources per node down. 

Greetings,
Stephan

Martin Neumann

unread,
Jun 2, 2014, 2:22:53 AM6/2/14
to stratosp...@googlegroups.com
Running them separately would mean each of them would have to redo the pre-processing each time. Currently this is just the StringKey inversion but I was planing to add more steps before that making that part more expensive.
Is there a plan to allow for cached DataSets so I can run them in sequence reusing the modified input sets?


Stephan Ewen

unread,
Jun 2, 2014, 4:00:39 AM6/2/14
to stratosp...@googlegroups.com
Yes, caching data sets is actually being worked on right now ;-)

Stephan Ewen

unread,
Jun 2, 2014, 8:57:40 AM6/2/14
to stratosp...@googlegroups.com, m.neuma...@gmail.com
The Master branch has a fix. The maven snapshots should be in sync in a bit.

I am porting the fix now to 0.5.1-SNAPSHOT
Reply all
Reply to author
Forward
0 new messages