Error in running PIG SCRIPT in DATAPROC CLUSTER

85 views

Skip to first unread message

chethan124 kumar

unread,

Dec 14, 2022, 5:30:40 AM12/14/22

to Google Cloud Dataproc Discussions

Hi Team

Im able to run word count job which is there in the documentation,

1)but when i try to submit job with properties file, not able read values from properties file.

Here is the code

gcloud dataproc jobs submit pig \
--cluster=cluster-workaround \
--region=us-east4 \
--verbosity=debug \
--properties=gs://intellibid-temp/cvr_gcs.properties \
--file=gs://intellibid-temp/intellibid-intermediat-cvr.pig

Below is the error

INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found ERROR org.apache.pig.impl.PigContext - Undefined parameter : udf_path 2022-12-13 11:58:51,504 [main] ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : udf_path

https://stackoverflow.com/questions/74784729/error-in-submiting-a-pig-job-to-google-dataproc-with-properties-file

2)When i try to submit the job like below

gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket-temp/cvr_gcs_one.properties --file=gs://bucket-temp/cvr_intelli_test.pig --jars=gs://bucket-intellibid-data-science/emr/jars/Intellibid_udfs.jar --params train_cvr=gs://bucket-temp/{2022-12-09},output_dir=gs://bucket-analytics-intellibid/Azkaban-Intellibid,currdate=2022-12-13,csv_dir=gs://bucket-intellibid-db-dump/prod,inp_dir=gs://bucket-temp/my_file.txt

Below is the pig script

register gs://bucket-data-science/emr/jars/Intellibid_udfs.jar;
--register $piggy

SET default_parallel 300;
SET pig.exec.mapPartAgg true; -- To remove load on combiner

SET pig.tmpfilecompression TRUE -- To make Compression true between MapReduce Job Mainly when using Joins
SET pig.tmpfilecompression.codec gz -- To Specify the type of compression between MapReduce Job
SET mapreduce.map.output.compress TRUE --To make Compression true between Map and Reduce
SET mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.GzipCodec
set mapred.map.tasks.speculative.execution false
SET mapreduce.task.timeout 10800000
set mapreduce.output.fileoutputformat.compress true
set mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.GzipCodec
SET mapreduce.map.maxattempts 16
SET mapreduce.reduce.maxattempts 16
SET mapreduce.job.queuename HIGH_PRIORITY

define VIZSUM com.company.udfs.common.COMPANYSUM();

--UDF to filter only relevant campaigns mentioned in intellibidadvidtoadvname.csv
-- define adv_exists com.company.udfs.common.TRANS_EXISTS('$csv_dir', 'intellibidadvidtoadvname.csv');
--UDF to Flatten Input data relevant to generate buckets(bucket.ini_pig)
-- define flattenBucketizeInput_cvr com.company.udfs.common.VIZFLATTEN('$csv_dir', 'variableList.ini', 'cvr');
--UDF to bucketize cvr data takes bucket.ini_pig as input
--define bucketize_cvr_data com.company.udfs.common.PROCESS_TRAIN('$csv_dir', 'variableList.ini', 'bucket.ini_pig', 'cvr');
--UDF to get keys of advertiser & publisher combo
define get_cvr_key com.company.udfs.common.ALL_CTR_MODEL('$csv_dir', 'variableList.ini', 'cvr', 'newcampaignToKeyMap', 'listofpublishers.txt','intellibid_pam_parent_mapping.csv','ib-google-geo-pub-map','apppublishers.csv', 'F,F,T,T');
--UDF to generate multiple files only for bg keys so that we can avoid a skew reducer.
define multiple_file_generator com.company.udfs.common.CVR_KEY_GENERATION('$csv_dir','newcampaignToKeyMap','cvr_big_campaign_list','30');

-- Load the data from GCS
data = LOAD '$inp_dir' USING PigStorage();

-- Split the data into individual words
words = FOREACH data GENERATE FLATTEN(TOKENIZE($0)) AS word;

-- Group the data by word and count the number of occurrences of each word
word_counts = GROUP words BY word;
word_counts = FOREACH word_counts GENERATE group, COUNT(words);

-- Store the word counts in GCS
STORE word_counts INTO '$output_dir/$currdate/' USING PigStorage();

BELOW IS THE ERROR

org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1670597881696_0007
2022-12-14 09:52:12,050 [JobControl] INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:cvr_intelli_test.pig got an error while submitting
java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:346)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:251)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop.PigJobControl.run(PigJobControl.java:205)
at java.lang.Thread.run(Thread.java:750)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:298)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:327)
at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:303)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:331)
... 16 more
2022-12-14 09:52:12,052 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1670597881696_0007
2022-12-14 09:52:12,052 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,word_counts,words
2022-12-14 09:52:12,052 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[37,7],words[40,8],word_counts[44,14],word_counts[43,14] C: word_counts[44,14],word_counts[43,14] R: word_counts[44,14]
2022-12-14 09:52:12,068 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2022-12-14 09:52:17,084 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2022-12-14 09:52:17,088 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at cluster-workaround-m/10.125.104.112:8032
2022-12-14 09:52:17,089 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at cluster-workaround-m/10.125.104.112:10200
2022-12-14 09:52:17,147 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at cluster-workaround-m/10.125.104.112:8032
2022-12-14 09:52:17,147 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at cluster-workaround-m/10.125.104.112:10200
2022-12-14 09:52:17,163 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2022-12-14 09:52:17,164 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
3.2.3 0.18.0-SNAPSHOT root 2022-12-14 09:52:10 2022-12-14 09:52:17 GROUP_BY

Failed!

Failed Jobs:
JobId Alias Feature Message Outputs
job_1670597881696_0007 data,word_counts,words GROUP_BY,COMBINER,MAP_PARTIALAGG Message: java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:346)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:251)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop.PigJobControl.run(PigJobControl.java:205)
at java.lang.Thread.run(Thread.java:750)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:298)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:327)
at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:303)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:331)
... 16 more
gs://bucket-intellibid/Azkaban-Intellibid/2022-12-13,

Input(s):
Failed to read data from "gs://intellibid-temp/my_file.txt"

Output(s):
Failed to produce result in "gs://bucket-intellibid/Azkaban-Intellibid/2022-12-13"

Can anyone help me here.

P.S:TO GOOGLE TEAM-> IF NOT ABLE TO PROVIDE ENOUGH DOCUMENTATION,PLEASE DON'T RELEASE THE PRODUCTS.