Using Hcatalog in Pig

484 views
Skip to first unread message

Ram Kedem

unread,
Jun 6, 2016, 8:07:40 AM6/6/16
to Google Cloud Dataproc Discussions
Hi,

Trying to implement Pig's dynamic partitioning using Hcatalog and receiving the following error (test is a table created in Hive) :

a = load 'test' using org.apache.hcatalog.pig.HCatLoader();
2016-06-06 11:55:27,317 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatLoader using imports

I wanted to know - is that feature is supported in Google Dataproc, and if so, which jars should I download ?

Many thanks

ram....@adk2.com

unread,
Jun 7, 2016, 9:02:21 AM6/7/16
to Google Cloud Dataproc Discussions
Trying to make my previous question more clear, that the errors I'm receiving when I'm trying to call HCatalog : 

pig -useHCatalog
ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory
ls: cannot access /usr/lib/hive-hcatalog/share/hcatalog/*hcatalog-core-*.jar: No such file or directory
ls: cannot access /usr/lib/hive-hcatalog/share/hcatalog/hcatalog-*.jar: No such file or directory
ls: cannot access /usr/lib/hive-hcatalog/lib/*hbase-storage-handler-*.jar: No such file or directory
ls: cannot access /usr/lib/hive-hcatalog/share/hcatalog/*hcatalog-pig-adapter-*.jar: No such file or directory

Angus Davis

unread,
Jun 7, 2016, 1:27:39 PM6/7/16
to Google Cloud Dataproc Discussions
HI Ram,

HCatalog is not installed by default, but can be added via an initialization action. In an initialization action, add:

apt-get install hive-hcatalog

After installing hcatalog, the following is the full statement to use the loader:

 A = LOAD 'tablename' USING org.apache.hive.hcatalog.pig.HCatLoader();

Angus

ram....@adk2.com

unread,
Jun 9, 2016, 1:11:28 AM6/9/16
to Google Cloud Dataproc Discussions
Thanks ! 

vivek.k...@exadatum.com

unread,
Mar 22, 2017, 10:10:03 AM3/22/17
to Google Cloud Dataproc Discussions
Hi Angus,

I am also getting the same error and I tried the solution that you gave but I am still getting the same error. I also tried to do the apt-get install hive-hcatalog from the google cloud shell and i was getting the error as "Unable to locate package hive-hcatalog".

I am running the pig script using gcloud, so when i am running the pig with "-useHCatalog" argument, I am getting the following error "ERROR: (gcloud.beta.dataproc.jobs.submit.pig) unrecognized arguments: -useHCatalog".
So i ran it without "-useHCatalog", i hope the error i am getting is not because of this.

Can you please help me with it ?

Vivek

Angus Davis

unread,
Mar 22, 2017, 3:08:44 PM3/22/17
to Google Cloud Dataproc Discussions
Hi Vivek,

Dataproc clusters have extra apt repositories listed so that they can install Hadoop ecosystem components such as hive-hcatalog; you will not be able to install this package from the cloud shell. 

The simplest solution for submitting pig jobs from gcloud I have found is included below and should be used as a Dataproc initialization action [1]. The initialization action first installs hive-hcatalog and then modifies /etc/pig/conf/pig-env.sh to include hcatalog by default for all pig invocations.



--- begin init action ---
#!/bin/bash

apt-get -q -y install hive-hcatalog

cat >>/etc/pig/conf/pig-env.sh <<EOF
#!/bin/bash

includeHCatalog=true

EOF
--- end init action ---

vivek.k...@exadatum.com

unread,
Mar 23, 2017, 1:56:57 AM3/23/17
to Google Cloud Dataproc Discussions
Hi Angus,

The steps you told me did worked fine for me. Thank you very much for the help.

Thanks,
Vivek
Reply all
Reply to author
Forward
0 new messages