Alluxio with Hive on EMR

536 views
Skip to first unread message

Dan

unread,
Jan 20, 2017, 3:08:10 PM1/20/17
to Alluxio Users
All,

Some quick notes about running Alluxio on Hive:

My Environment is as follows:
AWS EMR Cluster
EMR Version: emr-5.1.0
EMR Applications: Hive, Ganglia, Spark
Master Instance: r3.xlarge
Core Instances: r3.xlarge (4x)
Alluxio Version: 1.5.0-SNAPSHOT
Hadoop Version: 2.7.3-amzn-0
OS: Amazon Linux (based off of RHEL)
Java Version: openjdk version "1.8.0_111"

- On EMR, protobuf-2.5.0 is used which caused conflicts with the below JAR using 2.6.1. In the main POM, protobuf.version is a variable and can be set. I had to change the subsequent POMs to allow the version to be set to 2.5.0 (which is inherited from the main pom). So I built the project using the following:
mvn clean install -Pyarn -DskipTests=true -Dhadoop.version=2.7.3 -Djava.version=1.8
- This path, /<PATH_TO_ALLUXIO>/client/hadoop/alluxio-community-1.3.0-hadoop-client.jar doesn't exist and I couldn't find this Jar anywhere. Instead, I used HADOOP_CLASSPATH="/<PATH_TO_ALLUXIO>/core/client/target/*:${HADOOP_CLASSPATH}
- On EMR, HIVE_HOME is /usr/lib/hive
- Copy /<PATH_TO_ALLUXIO>/core/client/target/*.jar to $HIVE_HOME
- Hive seems to expect HIVE_INSTALL_DIR to be set; set this to $HIVE_HOME
- Add the following Classifications to your EMR setup (represented in CF):
         {
            "Classification": "hadoop-env",
            "Configurations": [
              {
                "Classification": "export",
                "ConfigurationProperties": {
                  "HADOOP_CLASSPATH" : "\"/<PATH_TO_ALLUXIO>/core/client/target/*:${HADOOP_CLASSPATH}\"",
                  "HIVE_INSTALL_DIR" : "/usr/lib/hive"
                },
                "Configurations": [ ]
              }
            ]
          },
          {
            "Classification" : "hive-site",
            "ConfigurationProperties" : {
              "fs.defaultFS" : "alluxio://localhost:19998"
             },
            "Configurations" : [ ]
          },
          {
            "Classification" : "core-site",
            "ConfigurationProperties" : {
              "fs.alluxio.impl" : "alluxio.hadoop.FileSystem",
              "fs.AbstractFileSystem.alluxio.impl" : "alluxio.hadoop.AlluxioFileSystem"
             },
            "Configurations" : [ ]
          }
- Note: alluxio://localhost:19998 should probably be the actual host name of the instance. I haven't figured out a "global" property that I could derive this from yet.
- ${HIVE_HOME}/bin/schematool -initSchema -dbType derby
    * I haven't been able to get this step to work on EMR, I get the following error:

Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"APP"."NUCLEUS_ASCII" (C CHAR(1)) RETURNS INTEGER LANGUAGE JAVA PARAMETER STYLE ' at line 1

Query is:

CREATE FUNCTION "APP"."NUCLEUS_ASCII" (C CHAR(1)) RETURNS INTEGER LANGUAGE JAVA PARAMETER STYLE JAVA READS SQL DATA CALLED ON NULL INPUT EXTERNAL NAME 'org.datanucleus.store.rdbms.adapter.DerbySQLFunction.ascii' (state=42000,code=1064)

org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!

Underlying cause: java.io.IOException : Schema script failed, errorcode 2

- CREATE TABLE example does work
- I get the following error when I launch the hive CLI. I think there's a property missing:

ERROR tez.TezSessionState: Failed to start Tez session

java.io.IOException: Incomplete HDFS URI, no host: hdfs:///apps/tez/tez.tar.gz

at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

at org.apache.tez.client.TezClientUtils.addLocalResources(TezClientUtils.java:217)

at org.apache.tez.client.TezClientUtils.setupTezJarsLocalResources(TezClientUtils.java:183)

at org.apache.tez.client.TezClient.getTezJarResources(TezClient.java:1057)

at org.apache.tez.client.TezClient.start(TezClient.java:447)

at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.startSessionAndContainers(TezSessionState.java:390)

at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.access$000(TezSessionState.java:96)

at org.apache.hadoop.hive.ql.exec.tez.TezSessionState$1.call(TezSessionState.java:327)

at org.apache.hadoop.hive.ql.exec.tez.TezSessionState$1.call(TezSessionState.java:323)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.lang.Thread.run(Thread.java:745)

- I get the following warning when I exit the Hive cli:

WARN util.ShutdownHookManager: ShutdownHook 'ClientFinalizer' failed, java.lang.NullPointerException

java.lang.NullPointerException

at alluxio.client.file.FileSystemContext.acquireMasterClient(FileSystemContext.java:228)

at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:178)

at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:172)

at alluxio.hadoop.AbstractFileSystem.getFileStatus(AbstractFileSystem.java:321)

at alluxio.hadoop.FileSystem.getFileStatus(FileSystem.java:25)

at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)

at org.apache.hadoop.fs.FileSystem.processDeleteOnExit(FileSystem.java:1409)

at org.apache.hadoop.fs.FileSystem.close(FileSystem.java:2070)

at alluxio.hadoop.AbstractFileSystem.close(AbstractFileSystem.java:134)

at alluxio.hadoop.FileSystem.close(FileSystem.java:25)

at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2760)

at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2777)

at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

-

Questions:
Should the protobuf-java module have a parameterized version (<version>${protobuf.version}</version>) in all POMs (ie core/client/pom.xml)?
Any ideas on what's causing the above errors?

Thanks,
Dan

Boe H

unread,
Jan 24, 2017, 11:30:11 PM1/24/17
to Alluxio Users
Any luck?  Would be great to hear of successful deployment on EMR.
Message has been deleted

黄志

unread,
Jan 25, 2017, 5:36:41 AM1/25/17
to Alluxio Users

Can you provide all the properties in hive-site.xml?Because you init hive with dbType = derby but it complaints that it's not a MYSQL syntax.And the Tez error seems that your HDFS URI is incomplete(I think it should be hdfs://host:port, not hdfs:///).

在 2017年1月21日星期六 UTC+8上午4:08:10,Dan写道:

Dan

unread,
Jan 25, 2017, 8:07:14 PM1/25/17
to Alluxio Users
Sure.

hive-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Licensed to the Apache Software Foundation (ASF) under one or more       -->
<!-- contributor license agreements.  See the NOTICE file distributed with    -->
<!-- this work for additional information regarding copyright ownership.      -->
<!-- The ASF licenses this file to You under the Apache License, Version 2.0  -->
<!-- (the "License"); you may not use this file except in compliance with     -->
<!-- the License.  You may obtain a copy of the License at                    -->
<!--                                                                          -->
<!--     http://www.apache.org/licenses/LICENSE-2.0                           -->
<!--                                                                          -->
<!-- Unless required by applicable law or agreed to in writing, software      -->
<!-- distributed under the License is distributed on an "AS IS" BASIS,        -->
<!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -->
<!-- See the License for the specific language governing permissions and      -->
<!-- limitations under the License.                                           -->

<configuration>

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
<!-- that are implied by Hadoop setup variables.                                                -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource).                                                                                 -->

<!-- Hive Execution Parameters -->


<property>
  <name>hbase.zookeeper.quorum</name>
  <value>ip-10-253-1-19.ec2.internal</value>
</property>

<property>
  <name>hive.execution.engine</name>
  <value>tez</value>
</property>

  <property>
    <name>fs.defaultFS</name>
    <value>alluxio://localhost:19998</value>
  </property>


  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://ip-10-253-1-19.ec2.internal:9083</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://ip-10-253-1-19.ec2.internal:3306/hive?createDatabaseIfNotExist=true</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.mariadb.jdbc.Driver</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>[REMOVED]</value>
    <description>password to use against metastore database</description>
  </property>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>

  <property>
    <name>mapred.reduce.tasks</name>
    <value>-1</value>
  </property>

  <property>
    <name>mapred.max.split.size</name>
    <value>256000000</value>
  </property>

  <property>
    <name>hive.metastore.connect.retries</name>
    <value>15</value>
  </property>

  <property>
    <name>hive.optimize.sort.dynamic.partition</name>
    <value>true</value>
  </property>

</configuration>

John C

unread,
Jan 26, 2017, 5:56:13 PM1/26/17
to Alluxio Users
Was able to get past this by editing /etc/tez/tez-site.xml <name>tez.lib.uris</name> value (it had something like hdfs:///apps...).

Apparently Tez is sometimes used by hive.


On Friday, January 20, 2017 at 3:08:10 PM UTC-5, Dan wrote:

Yufa Zhou

unread,
Feb 8, 2017, 11:36:45 PM2/8/17
to Alluxio Users
Hi Dan,

Have you resolved the problem?

Boe H

unread,
Feb 14, 2017, 11:08:30 AM2/14/17
to Alluxio Users
Hi Dan - Would be great if to hear an update.

Also for anyone who can answer this - I noticed there are 2 different versions of the jar file referenced depending on if you are reading docs on alluxio.com or alluxio.org:

1) alluxio-community-1.3.0-hadoop-client.jar
2) alluxio-core-client-1.3.0-jar-with-dependencies.jar

What exactly is the difference between these two libraries? I haven't had success with #2 on Hive/EMR but may give #1 a try.

Yufa Zhou

unread,
Feb 15, 2017, 1:43:15 AM2/15/17
to Alluxio Users
Dear Boe,

You have set Connection URL to the mysql database in hive-site.xml, which means that you want to store hive metadata in mysql. However, when you initializing hive scheme, you specified -dbType=derby, which is inconsistent with hive-site.xml. You can try to specify -dbType=mysql.

Yufa Zhou

unread,
Feb 28, 2017, 9:23:59 PM2/28/17
to Alluxio Users
Hi Dan,

Have you resolved the problem?

Best


On Saturday, January 21, 2017 at 4:08:10 AM UTC+8, Dan wrote:

Kumar Gadamsetty

unread,
Oct 25, 2018, 11:05:05 AM10/25/18
to Alluxio Users
Hi,

I'm facing the same problem. what should the value for tez.lib.uris parameter to be replaced to?
Reply all
Reply to author
Forward
0 new messages