HashJoin results for a SelfJoin of a small File produces partial result.

9 views

Skip to first unread message

Rakesh Iyer

unread,

Dec 6, 2021, 10:23:14 PM12/6/21

to cascading-user

I am a newbie to the Cascading API.

I wrote the following sample in italics that does a self-join using HashJoin.

String infile = args[0];
String outfile = args[1];

Properties properties = new Properties();
AppProps.setApplicationJarClass(properties, MyFlow8.class);
AppProps.setApplicationName(properties,"myflow8");

FlowConnector flowConnector = new Hadoop2MR1FlowConnector();
Fields sourceFields = new Fields("key", "value", "count");
Tap sourceTap = new Hfs(new TextDelimited(sourceFields), infile);
Tap sinkTapEvery = new Hfs(new TextDelimited(), outfile + "_every");

FlowDef flowDef = new FlowDef();
Pipe pipe = new Pipe("everypipe");
Pipe hashJoin = new HashJoin(pipe, new Fields("value"), 1, new Fields("key1", "value1", "count1", "key2", "value2", "count2"));

flowDef.addSource(pipe, sourceTap);
flowDef.addTailSink(hashJoin, sinkTapEvery);

flowDef.setAssertionLevel(AssertionLevel.STRICT);
flowDef.setDebugLevel(DebugLevel.VERBOSE);
Flow flow = flowConnector.connect(flowDef);

flow.complete();

With an input of

try this 1

try this 2

try this 5

try this 6

try this 7

try this 8

I get an output for the HashJoin split in.2 files.

File 1

try this 1 try this 1

try this 1 try this 2

try this 1 try this 5

try this 1 try this 6

try this 2 try this 1

try this 2 try this 2

try this 2 try this 5

try this 2 try this 6

try this 5 try this 1

try this 5 try this 2

try this 5 try this 5

try this 5 try this 6

try this 6 try this 1

try this 6 try this 2

try this 6 try this 5

try this 6 try this 6

and File 2

try this 7 try this 7

try this 7 try this 8

try this 8 try this 7

try this 8 try this 8

The expectation is that every input row would join but instead there seems to be some partitioning of input occurring (4 tuples in the first one, 2 tuples in the 2nd one) and the join is limited to the partitions.

If the input is fed through 2 seperate pipes and the seperate pipes are joined, the generated join is as expected, i.e. all 6 tuples are joined to generate 36 joined tuples.

Is this the expected behavior of HashJoin?

Chris K Wensel

unread,

Dec 7, 2021, 2:14:44 PM12/7/21

to cascadi...@googlegroups.com

Can you share what version of Hadoop you are running in the cluster?

Also, is this test run on the cluster or locally?

chris

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/4684c6c7-fdb0-46f8-a60f-6d71de6ac9c6n%40googlegroups.com.

—
Chris K Wensel
ch...@wensel.net

Reply all

Reply to author

Forward

0 new messages