Azure Hadoop support in Presto

943 views
Skip to first unread message

Adam Lacey

unread,
Sep 22, 2016, 11:41:36 AM9/22/16
to Presto
I'm wondering if anyone can point me in the right direction for getting support for querying Hive backed by Azure Storage.

This is possible using the Hadoop Azure Filesystem implemented and used within MS's HDInsight product: https://hadoop.apache.org/docs/stable2/hadoop-azure/index.html

My use case is I have an HDInsight cluster, where data is stored in Azure Storage Blobs, and the cluster is configured to use the wasb protocol (this hadoop-azure custom Filesystem implementation above) to read/write data.

I would like to use Presto to query the Hive tables defined in this cluster. I configured Presto hive-hadoop2 catalog to point at the Hive metastore. Additionally I copied the following JARs for the custom Filesystem:

- azure-storage-4.4.0.jar
- hadoop-azure-2.7.3.jar

To the <presto install>/plugin/hive-hadoop2 directory.

I can successfully query the metastore (SHOW DATABASES, SHOW TABLES, etc... work fine).

Querying data, e.g.

SELECT * FROM hive.default.hivesampletable;

Always returns 0 rows, when I have verified with the hive command line client that the query does in fact return data.

Any thoughts or things I could troubleshoot?

thanks!

Dain Sundstrom

unread,
Sep 22, 2016, 12:38:04 PM9/22/16
to presto...@googlegroups.com
The metadata commands are implemented exclusively using using Hive metastore RPCs, whereas the table scan code needs access to the filesystem. My guess is you need some configuration information (e.g., username and password) to access the storage system, but I would also expect to seem some exceptions. Alternatively it could be the file listing code in the filesystem is returning no files (you can see this in the UI because the table scan stage will have no tasks).

-dain
> --
> You received this message because you are subscribed to the Google Groups "Presto" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Adam Lacey

unread,
Sep 22, 2016, 1:17:45 PM9/22/16
to presto...@googlegroups.com
Thanks for the reply, Dain.

Yes I think the configuration is correct, as I have specifically listed relevant config files that provide the right azure storage account name and keys in the hive.config.resources property in the hive.properties file.

How would you recommend I troubleshoot this? Are the server log files the best place to see what's happening, or would I need to use a debugger to start digging?

thanks.

You received this message because you are subscribed to a topic in the Google Groups "Presto" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/presto-users/U7d0FcqMkuw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to presto-users...@googlegroups.com.

Dain Sundstrom

unread,
Sep 22, 2016, 1:46:08 PM9/22/16
to presto...@googlegroups.com
You can split the potential problem space by checking if you are getting any splits. If not, there is a problem in the file listing. Otherwise there is a problem in reading files. You can verify this by looking in the UI to see if you get tasks in the table scan stage. My guess is this is the problem.

After that, I’d personally use a debugger and put a break point in com.facebook.presto.hive.BackgroundHiveSplitLoader anywhere getFileBlockLocations is called. I’d guess either you are getting no results, or all the files are getting filtered out (do they start with `.` or `_`), or the files are in nested directories (and you need to enable "hive.recursive-directories”).

-dain

Adam Lacey

unread,
Sep 26, 2016, 5:09:50 PM9/26/16
to presto...@googlegroups.com
I'm testing with just the Presto CLI - not exactly sure where to look in a log file to see what's happening in different stages but this is what is output in the Presto server.log when I run my query:


java.lang.NoClassDefFoundError: org/mortbay/util/ajax/JSON$Convertor

        at org.apache.hadoop.fs.azure.NativeAzureFileSystem.createDefaultStore(NativeAzureFileSystem.java:1064)

        at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1035)

        at org.apache.hadoop.fs.PrestoFileSystemCache.createFileSystem(PrestoFileSystemCache.java:74)

        at org.apache.hadoop.fs.PrestoFileSystemCache.getInternal(PrestoFileSystemCache.java:61)

        at org.apache.hadoop.fs.PrestoFileSystemCache.get(PrestoFileSystemCache.java:43)

        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)

        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

        at com.facebook.presto.hive.HdfsEnvironment.lambda$getFileSystem$0(HdfsEnvironment.java:67)

        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)

        at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:66)

        at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:60)

        at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:280)

        at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:222)

        at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:77)

        at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:178)

        at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:45)

        at com.facebook.presto.hive.util.ResumableTasks.lambda$submit$1(ResumableTasks.java:33)

        at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:77)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ClassNotFoundException: org.mortbay.util.ajax.JSON$Convertor

        at com.facebook.presto.server.PluginClassLoader.loadClass(PluginClassLoader.java:106)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

        ... 21 more



2016-09-26T21:04:40.648Z        INFO    query-execution-2       com.facebook.presto.event.query.QueryMonitor    TIMELINE: Query 20160926_210438_00000_7k8yi :: Transaction:[c9010f6c-308c-43cd-a759-5b905e9eb2ad] :: elapsed 1641ms :: planning 708ms :: scheduling 933ms :: running 0ms :: finishing 933ms :: begin 2016-09-26T21:04:38.692Z :: end 2016-09-26T21:04:40.333Z


This is logged as WARN, although maybe indicates a more serious issue.  Could this be the root cause of why no data is returned?

Adam Lacey

unread,
Sep 26, 2016, 5:52:36 PM9/26/16
to presto...@googlegroups.com
After some poking around and copying more JAR files from HDInsight to the /plugin/hive-hadoop2 folder I am stuck on this exception in server.log:

2016-09-26T21:50:13.942Z WARN hive-hive-0 com.facebook.presto.hive.util.ResumableTasks ResumableTask completed exceptionally

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;

at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45)

at org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider.getStorageAccountKey(ShellDecryptionKeyProvider.java:40)

at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:841)

at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:921)

at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:439)

at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1160)

at org.apache.hadoop.fs.PrestoFileSystemCache.createFileSystem(PrestoFileSystemCache.java:74)

at org.apache.hadoop.fs.PrestoFileSystemCache.getInternal(PrestoFileSystemCache.java:61)

at org.apache.hadoop.fs.PrestoFileSystemCache.get(PrestoFileSystemCache.java:43)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

at com.facebook.presto.hive.HdfsEnvironment.lambda$getFileSystem$0(HdfsEnvironment.java:67)

at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)

at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:66)

at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:60)

at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:280)

at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:222)

at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:77)

at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:178)

at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:45)

at com.facebook.presto.hive.util.ResumableTasks.lambda$submit$1(ResumableTasks.java:33)

at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:77)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)



2016-09-26T21:50:14.163Z INFO query-execution-2 com.facebook.presto.event.query.QueryMonitor TIMELINE: Query 20160926_215012_00000_ukb8u :: Transaction:[381ea87a-1958-4ec6-93b3-cce079ec085b] :: elapsed 1578ms :: planning 855ms :: scheduling 723ms :: running 0ms :: finishing 723ms :: begin 2016-09-26T21:50:12.445Z :: end 2016-09-26T21:50:14.023Z


I can see that hadoop-common.jar in the plugin folder has this method implemented... not sure where else Presto server could be loading a JAR file that implements that class... any ideas?


Ankit Dixit

unread,
Sep 28, 2016, 9:22:31 AM9/28/16
to Presto, Adam Lacey
Hi Adam
The root cause of that particular error is that the shaded version of apache-hadoop2 builds on 2.7.x and the method excludeIncompatibleCredentialProviders is added in 2.8.
I was able to run select * and insert command after trying multiple combinations by using following jars versions in plugin/hive-hadoop2.
  • hadoop-azure-2.7.3.jar
  • hadoop-apache2-0.8.jar
  • azure-storage-3.1.0.jar
Regards
Ankit

Adam Lacey

unread,
Sep 28, 2016, 1:48:56 PM9/28/16
to presto...@googlegroups.com
Thanks Ankit - your reply was vital to me figuring this out.

You are right, the root cause is mismatched transitive dependencies between Presto's packaged hadoop-apache2*.jar and HDInsight's hadoop-azure*.jar and azure-storage*.jar.

Here's what I had to do for Presto to work with an HDInsight 3.4 cluster:

In the <install dir>/plugin/hive-hadoop2 directory, install the following files:
  • hadoop-azure-2.7.3.jar - note this is newer than the hadoop-azure JAR that is installed to nodes in HDInsight 3.4 - I downloaded from Maven Central
  • hadoop-apache2-0.8.jar - latest version of Presto's shaded hadoop-apache JAR - downloaded from Maven repo
  • Remove old version of hadoop-apache2.0.6.jar
  • azure-storage-4.4.0.jar - latest version of Azure storage SDK - downloaded from Maven repo
  • jetty-util-6.1.25.jar - needed to resolve a missing class for JSON$Converter used by HDInsight hadoop azure JAR - downloaded from Maven Central
  • commons-lang-2.6.jar - needed to resolve missing StringUtils class used by HDInsight hadoop azure JAR - copy from the following location of a node in the HDInsight cluster: /usr/hdp/current/hadoop-client/lib/commons-lang-2.6.jar
Follow the tutorial guide for setting up the Hive connector, add the following setting at the end of your hive.properties:

hive.config.resources=/etc/hadoop/conf/hdfs-site.xml,/etc/hadoop/conf/core-site.xml

Selecting data from Hive tables now works!

I'd like to hear what the Presto team thinks of packaging HDInsight support in the future - it seems the right way to do this is implement a new plugin (e.g. 'hive-hdi34' or something similar) and packaging up the needed JARs.  Ideally Microsoft improves its packaging and dependencies to something saner...

I hope this helps someone!

Ankit Dixit

unread,
Oct 18, 2016, 7:17:46 AM10/18/16
to Presto, Adam Lacey
  Using azure-storage-4.4.0.jar  will let you run select * from tables, but insert will fail. azure-storage-3.1.0.jar works for both select and insert operations.

Regards
Ankit

David Phillips

unread,
Oct 18, 2016, 11:24:54 AM10/18/16
to Presto, Adam Lacey
Thanks for the detailed info! We're happy to include support for Azure. I think the right way is to include the hadoop-azure dependency (from the Hadoop project) in the shaded hadoop-apache2 artifact.

There's an existing pull request to do that, but it needs some work:


It would also be nice to have an integration test for this that could be run after supplying appropriate Azure credentials.
Message has been deleted

dhka...@gmail.com

unread,
Jan 2, 2018, 3:03:09 PM1/2/18
to Presto

Mithun Arunan

unread,
Feb 14, 2018, 2:04:29 AM2/14/18
to Presto
Hey Adam. Thanks.
It's working fine but inserts into presto table with partitions is failing.

hadoop-azure-2.7.3.jar
azure-storage-4.4.0.jar
jetty-util-6.1.25.jar
commons-lang-2.6.jar
hadoop-apache2-2.7.3-1.jar instead of hadoop-apache2-0.8.jar

presto> insert into wasb.testdb.test5 select * from wasb.testdb.test4;
INSERT: 1 row

Query 20180214_065656_00037_p5wpj, FAILED, 5 nodes
http://localhost:8080/query.html?20180214_065656_00037_p5wpj
Splits: 86 total, 86 done (100.00%)
CPU Time: 0.1s total, 8 rows/s, 2.38KB/s, 26% active
Per Node: 0.0 parallelism, 0 rows/s, 66B/s
Parallelism: 0.1
0:01 [1 rows, 305B] [1 rows/s, 330B/s]

Query 20180214_065656_00037_p5wpj failed: org/codehaus/jackson/map/JsonMappingException
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.prepareAtomicFolderRename(NativeAzureFileSystem.java:2111)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1996)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.renameDirectory(SemiTransactionalHiveMetastore.java:1628)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.access$2500(SemiTransactionalHiveMetastore.java:78)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore$Committer.prepareAddPartition(SemiTransactionalHiveMetastore.java:1052)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore$Committer.access$700(SemiTransactionalHiveMetastore.java:887)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.commitShared(SemiTransactionalHiveMetastore.java:817)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.commit(SemiTransactionalHiveMetastore.java:739)
at com.facebook.presto.hive.HiveMetadata.commit(HiveMetadata.java:1517)
at com.facebook.presto.hive.HiveConnector.commit(HiveConnector.java:177)
at com.facebook.presto.transaction.TransactionManager$TransactionMetadata$ConnectorTransactionMetadata.commit(TransactionManager.java:577)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.JsonMappingException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at com.facebook.presto.server.PluginClassLoader.loadClass(PluginClassLoader.java:76)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 19 more



On Wednesday, September 28, 2016 at 11:18:56 PM UTC+5:30, Adam Lacey wrote:
> Thanks Ankit - your reply was vital to me figuring this out.
>
>
> You are right, the root cause is mismatched transitive dependencies between Presto's packaged hadoop-apache2*.jar and HDInsight's hadoop-azure*.jar and azure-storage*.jar.
>
>
> Here's what I had to do for Presto to work with an HDInsight 3.4 cluster:
>
>
> In the <install dir>/plugin/hive-hadoop2 directory, install the following files:
> hadoop-azure-2.7.3.jar - note this is newer than the hadoop-azure JAR that is installed to nodes in HDInsight 3.4 - I downloaded from Maven Centralhadoop-apache2-0.8.jar - latest version of Presto's shaded hadoop-apache JAR - downloaded from Maven repoRemove old version of hadoop-apache2.0.6.jarazure-storage-4.4.0.jar - latest version of Azure storage SDK - downloaded from Maven repojetty-util-6.1.25.jar - needed to resolve a missing class for JSON$Converter used by HDInsight hadoop azure JAR - downloaded from Maven Centralcommons-lang-2.6.jar - needed to resolve missing StringUtils class used by HDInsight hadoop azure JAR - copy from the following location of a node in the HDInsight cluster: /usr/hdp/current/hadoop-client/lib/commons-lang-2.6.jar
Reply all
Reply to author
Forward
0 new messages