cascading hive repository

165 views
Skip to first unread message

neil....@wherescape.com

unread,
May 5, 2015, 7:59:49 PM5/5/15
to cascadi...@googlegroups.com
I am attempting to pull data from an Oracle RDBMS and load into a Hive table.  I have no problems writing the data into HDFS, so I know the source tap works fine and Hfs Sink works.  When I switch the sink tap over to Hive, I encounter a number of errors that I seem unable to resolve.

a) My hive-site.xml is configured to use a MySQL database for the Hive repository.   However, when the Cascading job runs it appears to be using/creating a local derby database in the working directory of the java process, rather than connecting to my MySQL database

b) if the hive directory for the table already exists, I get an error;
       "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ubuntu-apache251:9000/user/hive/warehouse/perm_ora_customer already exists"

c) if the table does not exist, then I get;
    "Caused by: MetaException(message:file:/user/hive/warehouse/perm_ora_customer is not a directory or unable to create one)"
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1239)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)






NOTES: 
- I am programmatically invoking the job from within a tomcat webapp.
- I can successfully run a Sqoop (v1) command from the same tomcat webapp and have it load the data without any issues -- so I know Hive and Hadoop are configured correctly


Any ideas;   I was hoping to use the Cascading JDBC and Hive components as a way to transfer data between a relational database and Hadoop/Hive, rather than using Sqoop or Sqoop2.  However, these issues are making me somewhat concerned that the Cascading Hive components are not production-ready.


Below are the configuration parameters I am setting -- referencing the relevant hadoop/hive configuration files.

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/core-site.xml"));

conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/hdfs-site.xml"));

conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/hive-site.xml"));

conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/mapred-site.xml"));






Andre Kelpe

unread,
May 6, 2015, 11:24:36 AM5/6/15
to cascadi...@googlegroups.com
Hi Neal,

can you try the latest wip-1.1 version of Cascading Hive? I believe that is a problem I have fixed, where it would not pass the metastore information from the client side onto the cluster. If you cannot use that version, make sure the metastore is configured cluster side in the yarn-site.xml like this:

  <property>
      <name>hive.metastore.uris</name>
      <value>thrift://<hostname>:9083</value>
  </property>

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/d4c3526f-428d-4dff-ba69-e3e0f6131296%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Neil Barton

unread,
May 6, 2015, 3:29:54 PM5/6/15
to cascadi...@googlegroups.com
Thanks for the response.

I used hive-1.1.0-wip-19.  Moved the jar file up onto HDFS, included into the DistributedCache and then reran the same job.

On the positive side the table was created and loaded and metadata updated in the shared MySQL Hive repository. So that was good.

I did note that it still created a derby metastore_db directory and associated derby.log file in the directory from where I started Tomcat.  A bit odd.

Cheers,

You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/2hmQox_R-iU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

Andre Kelpe

unread,
May 7, 2015, 12:09:37 PM5/7/15
to cascadi...@googlegroups.com
Putting the jar cluster side should not be necessary, but I am glad it works for you now.

The metastore_db thing is happening by itself it seems, at least I have not found a way to turn it off. I assume the hive shell does this somehow, so we should copy that behaviour. Feel free to take a shot at that, we are more than happy to take contributions.

- André


For more options, visit https://groups.google.com/d/optout.

sam...@gmail.com

unread,
Oct 1, 2015, 11:03:30 AM10/1/15
to cascading-user
Hi Andre,

Can we turn off the metastore_db and Derby log creation when i run the job, As i am facing Directory cannot be created issue  while running the jar. I am using the batchId to run it but it dont have permissions to create it.
Even if we change the location to some temp folder works for me



thank you.

Andre Kelpe

unread,
Oct 1, 2015, 12:05:14 PM10/1/15
to cascading-user
If you are using a cluster, you have to have a remote metastore, otherwise you will run into inconsistencies. It will not work correctly without one.

That being said, if you want to change it for a test, yo can do this:

System.setProperty( "derby.system.home",  "/tmp/derby" );
System.setProperty( "derby.stream.error.file", "/dev/null" );


- André




For more options, visit https://groups.google.com/d/optout.

sam...@gmail.com

unread,
Oct 1, 2015, 12:55:04 PM10/1/15
to cascading-user
Thank you Andre. I am using cluster and can i have some more details on having remote metastore setup

sam...@gmail.com

unread,
Oct 2, 2015, 9:57:34 AM10/2/15
to cascading-user
Hi Andre,

I am using remote metastore now but the issue is with permissions of the folder. As we have a process to deploy this jar to a folder where we dont have permissions to create new folders. just we can execute the job from there. I am facing issue creating the folder at jar location. Is it something we can change the path instead of current job running location
really appreciate your help on this

Thank You

Andre Kelpe

unread,
Oct 2, 2015, 12:09:02 PM10/2/15
to cascading-user
If the remote meta-store is correctly configured, it should not create any directories. We might not read hive-site.xml by default, but if you put the metastore config into yarn-site.xml it will work. Note that it is necessary to have this correctly configured on the client and on the cluster, since the client writes to the metastore and inidividual mappers/reducers as well (in case of a partitioned hive table).

- André


For more options, visit https://groups.google.com/d/optout.

Arshad Ali Sayed

unread,
Aug 30, 2016, 7:10:23 AM8/30/16
to cascading-user
Hello Andre,

I am facing same issue of creation of metastore_db folder. I have tried to execute cascading-hive on cluster where remote metastore is configured using MySql. Whenever I run cascading hive job it creates metastore_db folder inside the directory from wherever I launched the job.

I tried following ways to get rid of this after referring above discussions:
1) Putting the metastore config into yarn-site.xml. 
2) Adding hive-site.xml as a resource in Configuration object.

In case 1 it was able to create table but remote metastore didn't have the table entry. 
In case 2 remote metastore was updated with table metadata.
In both the above cases metastore_db folder was created.

Could you please help me to solve this problem? Do I need to add any code changes/property to avoid creation of metastore_db directory?l

Note: In my case, running Hive CLI doesn't creates metastore_db folder if remote metastore is configured which is correct. 

Thanks
Arshadali

Andre Kelpe

unread,
Aug 30, 2016, 7:58:35 AM8/30/16
to cascading-user
Did you configure the remote metastore on the machine, where you
launch the job from or only cluster side? It has to be on both.

- André
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cascading-use...@googlegroups.com.
> To post to this group, send email to cascadi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/cascading-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cascading-user/81cda7ed-0902-43f7-b224-21b43ba2c9a7%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages