Bug: Pig parameter substitution doesn't work with embedded query

148 views
Skip to first unread message

Alex Van Boxel

unread,
Oct 31, 2016, 7:22:58 AM10/31/16
to Google Cloud Dataproc Discussions
I have an annoying bug with some legacy pig scripts. It seems that the parameter substitution doesn't work when you submit the job via an embedded query. Example:


 load '${orders}' using AvroStorage();

Parameters
orders: gs://bucket/datasets/raw/carts/v1/2016/08/
out: gs://bucket/datasets/output/bigquery/cart-order/2016/08

well this gives:

Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245: Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage
	at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:179)
	at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
	... 24 more
Caused by: java.io.IOException: No path matches pattern [Lorg.apache.hadoop.fs.Path;@663bb8ef
	at org.apache.pig.builtin.AvroStorage.getAvroSchema(AvroStorage.java:341)
	at org.apache.pig.builtin.AvroStorage.getAvroSchema(AvroStorage.java:313)
	at org.apache.pig.builtin.AvroStorage.getSchema(AvroStorage.java:287)
	at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
	... 25 more

I checked that it worked by hard coding it in the script, and this worked:

load 'gs://bucket/datasets/raw/carts/v1/2016/08/' using AvroStorage(); 

this gives:

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1477827119007_0002	2	0	72	66	69	69	0	0	0	0	orders,out	MAP_ONLY	hdfs://h2-dataproc-m/user/root/asas${out},

Input(s):
Successfully read 1601145 records (766 bytes) from: "gs://vex-eu-data/datasets/raw/carts/v1/2016/08"

I came upon this because I knew it worked if I use queryFileUri (iso embedding the query). I changed strategy while porting my Luigi operator to Airflow (where I now embed the query). I added code to work now with references to gs references and again it works.

Still, it's annoying. Is this a bug or will it never be supported (parameter substitution with embedded queries)?



Dennis Huo

unread,
Oct 31, 2016, 2:53:16 PM10/31/16
to Google Cloud Dataproc Discussions
By design, Dataproc's embedded queries are inteded to work the same as "pig -e" or "pig -execute", vs having queryFileUri being the same as "pig -f", since people may depend on various subtle differences in how Pig handles the two.

Unfortunately, this means we also inherit behavioral differences which may be seen as bugs, like in this case. I'd say it should be considered a bug in Pig, and that parameter substitution *ought* to work with inline queries the same way as they work with query files.

My lightweight repro which shows in both "gcloud dataproc" and in directly running pig:

# Fails
gcloud dataproc jobs submit pig --cluster dhuo-new --params filename=file:///etc/pig/conf/log4j.properties.template --execute "a = load '\${filename}'

# Works
echo "a = load '\${filename}'; dump a;" > test.pig; gcloud dataproc jobs submit pig --cluster dhuo-new --params filename=file:///etc/pig/conf/log4j.properties.template --file test.pig

Looking at the code, they only call runParamPreprocessor inside the "case FILE:" block, and there's no equivalent in the "case STRING:" block:


Possibly because the helper function seems to assume a file and does parameter substitution by creating a new substituted file, from what I can tell.

In this case you could file a JIRA if one doesn't already exist tracking this in Pig upstream, or possibly someone from Google can do it if we have some time to diagnose it with more certainty, and then if we can get a patch submitted upstream, it can be backported in a future Dataproc version.

Alex Van Boxel

unread,
Oct 31, 2016, 7:03:21 PM10/31/16
to Google Cloud Dataproc Discussions
Thanks for the info. We're phasing our Pig anyway, but it's important for the Apache Airflow operator for DataProc Pig. I'll add the it to the Airflow documentation for the operator.

Funny thing was that when I wrote the Luigi operator for DataProc I copied the file over to storage (so it worked).



Reply all
Reply to author
Forward
0 new messages