Pig and UDF's

209 views
Skip to first unread message

Alex Van Boxel

unread,
Jan 22, 2016, 5:05:22 PM1/22/16
to Google Cloud Dataproc Discussions
OK, till now I started Pig script (even on DataProc) on a seperate machine with Pig installed (so it submitted it directly to YARN).

Before DataProc, I registered the UDF just by registering a local jar:

REGISTER /opt/pig/lib/datafu-1.2.0.jar
REGISTER /opt/pig/lib/piggybank.jar

Now as I'm moving all my scripts to submitting them I have a problem with the UDF's (without UDF is works ok). I add the UDF's as jarFileUris via the API. Example:


{'job': {'pigJob': {'queryFileUri': 'gs://bucket/staging/ranking-2016-01-22__35065754-c150.pig', 'jarFileUris': ['gs://bucket/udf/jar/piggybank.jar', 'gs://bucket/udf/jar/datafu-1.2.0.jar'], 'continueOnFailure': False, 'scriptVariables': {'YYYY_lastmonth': '15', 'month': '01', 'MM_lastmonth': '15', 'year': '2016', 'day': '22', 'out': 'gs://bucket/datasets/output/sql/ranking/2016/01/22/'}}, 'placement': {'clusterName': 'h2-dataproc'}, 'reference': {'projectId': 'foobar', 'jobId': 'ranking-2016-01-22__35065754-c150'}}}

But it's not because I list the jars on the classpath that they are registered in the pig script. I can't register them via the REGISTER function because Pig expects them on the local file system, but because they only are stored on GCS I don't know what to do with the register function in Pig...

I can't remove them, because then I get an error as soon as I want to access an UDF:

DEFINE Enumerate datafu.pig.bags.Enumerate('1');

<file /tmp/ranking-2016-01-22__35065754-c150/ranking-2016-01-22__35065754-c150.pig, line 149, column 18> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve datafu.pig.bags.Enumerate using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Probably I can hack something together and push the UDF's to the master where they are probably compiled and submitted to the YARN nodes... but I don't think that is the purpose of an API.

Anyone and idea?!


Dennis Huo

unread,
Jan 22, 2016, 5:21:31 PM1/22/16
to Google Cloud Dataproc Discussions
I'll have to double check, but the jarfiles should be made available as an unqualified filename inside the working directory of the pig driver, so you should be able to do

REGISTER datafu-1.2.0.jar
REGISTER piggybank.jar

Alex Van Boxel

unread,
Jan 23, 2016, 6:28:40 AM1/23/16
to Google Cloud Dataproc Discussions
Yes, that works. Thanks! That means I can start to port all my Pig tasks. yee
Reply all
Reply to author
Forward
0 new messages