Error in running PIG SCRIPT in DATAPROC CLUSTER

85 views
Skip to first unread message

chethan124 kumar

unread,
Dec 14, 2022, 5:30:40 AM12/14/22
to Google Cloud Dataproc Discussions
Hi Team

Im able to run word count job which is there in the documentation, 

1)but when i try to submit job with properties file, not able read values from properties file.

Here is the code

gcloud dataproc jobs submit pig \
  --cluster=cluster-workaround \
  --region=us-east4 \
  --verbosity=debug \
  --properties=gs://intellibid-temp/cvr_gcs.properties \
  --file=gs://intellibid-temp/intellibid-intermediat-cvr.pig

Below is the error

INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found ERROR org.apache.pig.impl.PigContext - Undefined parameter : udf_path 2022-12-13 11:58:51,504 [main] ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : udf_path



2)When i try to submit the job like below

gcloud dataproc jobs submit pig   --cluster=cluster-workaround   --region=us-east4   --verbosity=debug   --properties-file=gs://bucket-temp/cvr_gcs_one.properties   --file=gs://bucket-temp/cvr_intelli_test.pig  --jars=gs://bucket-intellibid-data-science/emr/jars/Intellibid_udfs.jar  --params train_cvr=gs://bucket-temp/{2022-12-09},output_dir=gs://bucket-analytics-intellibid/Azkaban-Intellibid,currdate=2022-12-13,csv_dir=gs://bucket-intellibid-db-dump/prod,inp_dir=gs://bucket-temp/my_file.txt

Below is the pig script

register gs://bucket-data-science/emr/jars/Intellibid_udfs.jar;
--register $piggy


SET default_parallel 300;
SET pig.exec.mapPartAgg true; -- To remove load on combiner

SET pig.tmpfilecompression TRUE          -- To make Compression true between MapReduce Job Mainly when using Joins
SET pig.tmpfilecompression.codec gz     -- To Specify the type of compression between MapReduce Job
SET mapreduce.map.output.compress TRUE      --To make Compression true between Map and Reduce
SET mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.GzipCodec
set mapred.map.tasks.speculative.execution false
SET mapreduce.task.timeout 10800000
set mapreduce.output.fileoutputformat.compress true
set mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.GzipCodec
SET mapreduce.map.maxattempts 16
SET mapreduce.reduce.maxattempts 16
SET mapreduce.job.queuename HIGH_PRIORITY


define VIZSUM com.company.udfs.common.COMPANYSUM();


--UDF to filter only relevant campaigns mentioned in intellibidadvidtoadvname.csv
-- define adv_exists   com.company.udfs.common.TRANS_EXISTS('$csv_dir', 'intellibidadvidtoadvname.csv');
--UDF to Flatten Input data relevant to generate buckets(bucket.ini_pig)
-- define flattenBucketizeInput_cvr com.company.udfs.common.VIZFLATTEN('$csv_dir', 'variableList.ini', 'cvr');
--UDF to bucketize cvr data takes bucket.ini_pig as input
--define bucketize_cvr_data com.company.udfs.common.PROCESS_TRAIN('$csv_dir', 'variableList.ini', 'bucket.ini_pig',  'cvr');
--UDF to get keys of advertiser & publisher combo
define get_cvr_key com.company.udfs.common.ALL_CTR_MODEL('$csv_dir', 'variableList.ini', 'cvr', 'newcampaignToKeyMap', 'listofpublishers.txt','intellibid_pam_parent_mapping.csv','ib-google-geo-pub-map','apppublishers.csv', 'F,F,T,T');
--UDF to generate multiple files only for bg keys so that we can avoid a skew reducer.
define multiple_file_generator com.company.udfs.common.CVR_KEY_GENERATION('$csv_dir','newcampaignToKeyMap','cvr_big_campaign_list','30');


-- Load the data from GCS
data = LOAD '$inp_dir' USING PigStorage();

-- Split the data into individual words
words = FOREACH data GENERATE FLATTEN(TOKENIZE($0)) AS word;

-- Group the data by word and count the number of occurrences of each word
word_counts = GROUP words BY word;
word_counts = FOREACH word_counts GENERATE group, COUNT(words);

-- Store the word counts in GCS
STORE word_counts INTO '$output_dir/$currdate/' USING PigStorage();


BELOW IS THE ERROR
org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1670597881696_0007
2022-12-14 09:52:12,050 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:cvr_intelli_test.pig got an error while submitting
java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
    at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:346)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:251)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
    at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.pig.backend.hadoop.PigJobControl.submit(PigJobControl.java:128)
    at org.apache.pig.backend.hadoop.PigJobControl.run(PigJobControl.java:205)
    at java.lang.Thread.run(Thread.java:750)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:298)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:327)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:303)
    at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:331)
    ... 16 more
2022-12-14 09:52:12,052 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1670597881696_0007
2022-12-14 09:52:12,052 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,word_counts,words
2022-12-14 09:52:12,052 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[37,7],words[40,8],word_counts[44,14],word_counts[43,14] C: word_counts[44,14],word_counts[43,14] R: word_counts[44,14]
2022-12-14 09:52:12,068 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2022-12-14 09:52:17,084 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2022-12-14 09:52:17,088 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at cluster-workaround-m/10.125.104.112:8032
2022-12-14 09:52:17,089 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at cluster-workaround-m/10.125.104.112:10200
2022-12-14 09:52:17,147 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at cluster-workaround-m/10.125.104.112:8032
2022-12-14 09:52:17,147 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at cluster-workaround-m/10.125.104.112:10200
2022-12-14 09:52:17,163 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2022-12-14 09:52:17,164 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
3.2.3    0.18.0-SNAPSHOT    root    2022-12-14 09:52:10    2022-12-14 09:52:17    GROUP_BY

Failed!

Failed Jobs:
JobId    Alias    Feature    Message    Outputs
job_1670597881696_0007    data,word_counts,words    GROUP_BY,COMBINER,MAP_PARTIALAGG    Message: java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
    at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:346)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:251)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
    at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.pig.backend.hadoop.PigJobControl.submit(PigJobControl.java:128)
    at org.apache.pig.backend.hadoop.PigJobControl.run(PigJobControl.java:205)
    at java.lang.Thread.run(Thread.java:750)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:298)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1670597881696_0007 to YARN : Application application_1670597881696_0007 submitted by user root to unknown queue: HIGH_PRIORITY
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:327)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:303)
    at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:331)
    ... 16 more
    gs://bucket-intellibid/Azkaban-Intellibid/2022-12-13,

Input(s):
Failed to read data from "gs://intellibid-temp/my_file.txt"

Output(s):
Failed to produce result in "gs://bucket-intellibid/Azkaban-Intellibid/2022-12-13"

Can anyone help me here.

P.S:TO GOOGLE TEAM-> IF NOT ABLE TO PROVIDE ENOUGH DOCUMENTATION,PLEASE DON'T RELEASE THE PRODUCTS.

Thanks



Reply all
Reply to author
Forward
0 new messages