Reading from Hive Views as taps in a Cascading flow

26 views
Skip to first unread message

John Lavoie

unread,
Oct 19, 2016, 4:32:02 PM10/19/16
to cascading-user

We’re trying to add some new capabilities to our existing Cascading jobs to read data in from alternate sources.  We were hoping to just convert our input taps to Hive views instead of the flat files they are currently ingesting, but we’re running into some challenges.  This doesn’t really seem to be something that is possible with the current framework APIs.   It appears that the Hive capabilities are for executing queries and storing the results temporarily as files, for later steps to read in as taps. 

 

What we were hoping to be able to streamline the steps to eliminate extra reads/writes and minimize the temporary storage that the job may require.   The cascading-hive library seems to just look up the table metadata and read the underlying files into the tap, and therefore is not able to support hive views.  We are really trying to avoid the additional IO that materializing these views or storing the results in a temp table would cost.  The majority of our executions are on temporary datasets that are being redefined daily or more frequently and we’re trying to streamline the manipulations on them so that we can minimize the overhead for each execution, so doing a complete copy and persist of the entire dataset is undesireable.

 

Are there any reasonable ways to achieve what we are trying to do?


Thanks,

John

Reply all
Reply to author
Forward
0 new messages