join with hierarchical structure

32 views
Skip to first unread message

Mathieu D

unread,
Mar 20, 2015, 10:50:42 AM3/20/15
to cascadi...@googlegroups.com
Hi cascading folks,


I need to implement a sophisticated join with a hierachical structure.

On the left : a classical pipe with an id in the fields.

On the right : a pipe with data coming from a table having a parent/child hierarchy like this :

id  |  parent_id | val 
------------------------
10  |  5         |  AAA
 9  |  5         |  BBB
 8  |  3         |  CCC
...
 5  |  3         |  DDD
 4  |   null     |  EEE
 3  |   null     |  FFF


My "join" needs to "unwind" the hierarchy until parent_id is null (ie lookup : id -> parent_id -> id -> parent_id ...  until I reach a null)
Meaning that my output after a join between my left table and this hierarchy should be this :

lhs.id | join output
----------------------
 10    |   FFF 
  9    |   FFF
  8    |   FFF
  5    |   FFF
  4    |   EEE
  3    |   FFF


I hava no idea how I can build this using existing Operations in cascading API.
If I were to implement this manually in M/R, I would implement a sort a hashmap, loading the full rhs table in memory on map-side (fortunately it's quite small), doing the hierarchy "unwinding" in memory, and do the join like this.

How can I do that w/ Cascading ?

Thanks for your help
Mathieu




Ken Krugler

unread,
Mar 20, 2015, 11:16:28 AM3/20/15
to cascadi...@googlegroups.com
Hi Mathieu,

Assuming you ultimately get a Map<id, val> that fits in memory, then this function is trivial.

See my "RE: Do I need a custom function" email from a few days ago.

-- Ken


From: Mathieu D

Sent: March 20, 2015 7:50:42am PDT

To: cascadi...@googlegroups.com

Subject: join with hierarchical structure



--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr




Mathieu D

unread,
Mar 20, 2015, 1:07:15 PM3/20/15
to cascadi...@googlegroups.com
Indeed, once I have my map in memory, this is trivial.

My question becomes : how do I fill this in-memory Map from the rhs pipe of my flow ? 

Mathieu

Ken Krugler

unread,
Mar 20, 2015, 1:20:42 PM3/20/15
to cascadi...@googlegroups.com


From: Mathieu D

Sent: March 20, 2015 10:07:12am PDT

To: cascadi...@googlegroups.com

Subject: Re: join with hierarchical structure


Indeed, once I have my map in memory, this is trivial.

My question becomes : how do I fill this in-memory Map from the rhs pipe of my flow ? 

That's become a popular question on the list :)

The issue is that this is essentially a "side-channel", so Cascading doesn't know that it shouldn't run the portion of the flow that uses this data until the upstream portion of the flow has completed.

The simplest approach is to break up the overall flow into two pieces - you run the first Flow, then pull the resulting data into memory (just open the Tap on that file and read it) and use that map to construct the second Flow.

-- Ken

Mathieu D

unread,
Mar 20, 2015, 1:46:24 PM3/20/15
to cascadi...@googlegroups.com
yuk ! 
so ugly :-(


I guess trying to implement what is done in HashJoin would be very hard or a too deep dive in cascading fwk just to do this. ..

Mathieu D

unread,
Mar 20, 2015, 4:11:55 PM3/20/15
to cascadi...@googlegroups.com
And what if I use, on my rhs pipe a GroupBy on a dummy constant key, and use a CustomAggregator to mount all the tuples in memory, to flatten my hierarchy ?


Le vendredi 20 mars 2015 18:20:42 UTC+1, kkrugler a écrit :

Mathieu D

unread,
Mar 30, 2015, 4:34:44 PM3/30/15
to cascadi...@googlegroups.com
Just to share, I finally implemented it using a custom Buffer operation, with a null key for grouping.
In the operate method, i load all the incoming data in a map in-memory, and then I resolve the hierarchy and produce the output.

The data I load should easily fit in memory, but if it's not the case, I don't exactly what will happen (memory management around the buffer is unclear for me).

Mathieu

Oscar Boykin

unread,
Mar 30, 2015, 4:41:03 PM3/30/15
to cascadi...@googlegroups.com
The scalding typed-API has a means to do this. You can convert a TypedPipe[T] to an Iterable[T] inside an Execution container, which is basically like a Future container if you have seen those before (when you run the Execution, the flows, possibly iterative, are submitted).

It does seem to come up a lot and we are using this at Twitter a lot since it was added about 6 months ago.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/18a9faf3-f437-49dd-a509-64eea5480766%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Oscar Boykin :: @posco :: http://twitter.com/posco
Reply all
Reply to author
Forward
0 new messages