OK, till now I started Pig script (even on DataProc) on a seperate machine with Pig installed (so it submitted it directly to YARN).
Before DataProc, I registered the UDF just by registering a local jar:
REGISTER /opt/pig/lib/datafu-1.2.0.jar
REGISTER /opt/pig/lib/piggybank.jar
Now as I'm moving all my scripts to submitting them I have a problem with the UDF's (without UDF is works ok). I add the UDF's as jarFileUris via the API. Example:
{'job': {'pigJob': {'queryFileUri': 'gs://bucket/staging/ranking-2016-01-22__35065754-c150.pig', 'jarFileUris': ['gs://bucket/udf/jar/piggybank.jar', 'gs://bucket/udf/jar/datafu-1.2.0.jar'], 'continueOnFailure': False, 'scriptVariables': {'YYYY_lastmonth': '15', 'month': '01', 'MM_lastmonth': '15', 'year': '2016', 'day': '22', 'out': 'gs://bucket/datasets/output/sql/ranking/2016/01/22/'}}, 'placement': {'clusterName': 'h2-dataproc'}, 'reference': {'projectId': 'foobar', 'jobId': 'ranking-2016-01-22__35065754-c150'}}}
But it's not because I list the jars on the classpath that they are registered in the pig script. I can't register them via the REGISTER function because Pig expects them on the local file system, but because they only are stored on GCS I don't know what to do with the register function in Pig...
I can't remove them, because then I get an error as soon as I want to access an UDF:
DEFINE Enumerate datafu.pig.bags.Enumerate('1');
<file /tmp/ranking-2016-01-22__35065754-c150/ranking-2016-01-22__35065754-c150.pig, line 149, column 18> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve datafu.pig.bags.Enumerate using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Probably I can hack something together and push the UDF's to the master where they are probably compiled and submitted to the YARN nodes... but I don't think that is the purpose of an API.
Anyone and idea?!