Load and Index data from HDFS

164 views
Skip to first unread message

karteek chada

unread,
Jul 21, 2015, 4:53:15 PM7/21/15
to Druid User
Hello Druid Guru's,

I have few questions before i  load and index data from hdfs ( i did try an example did fail because i am sure i missed few steps before i run the task. hence the questions below)

  1. Do i have to add/modify any config or properties files in druid installation  to provide path or connectivity from druid to  hdfs?
  2. Are there any other environment settings i need to be aware of other than question 1

Regards
karteek

Gian Merlino

unread,
Jul 21, 2015, 5:11:58 PM7/21/15
to druid...@googlegroups.com
Hey Karteek, if you're not doing it already, try including your Hadoop jars and Hadoop config XMLs on Druid's classpath. Also, if you're using HDFS for deep storage (as opposed to just using Hadoop for indexing) then make sure to include the druid-hdfs-storage module as one of your extensions and set druid.storage.type=hdfs on all your Druid nodes.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/8f31fdb2-b838-430e-92b8-2aa87fb552e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jz shen

unread,
Aug 24, 2015, 3:53:03 AM8/24/15
to Druid User
thanks Gian..glad to see your answer to this question.I've got confused how to use HDFS for my deep storage. I've used local file system as deep storage and it worked well, then i try to replace it with hdfs (p.s. i've got a hadoop cluster), i set the config file for editting storage type as hdfs, and provided a path such as "hdfs://<myHostName>:9000/<path-to-myDir>", i boot the druid, run my start-dfs.sh to boot hdfs. However, as a result it didn't keep my data into hdfs, instead, it make a recursion directories named hdfs:, <myHostName>:9000, <path-to-myDir>. Well, i have to admit that i didn't including my Hadoop jars and config XMLs on Druid's classpath. Would you please help me and tell me how to do it so that i can successfully set up deep storage using hdfs, thanks..

在 2015年7月22日星期三 UTC+8上午5:11:58,Gian Merlino写道:

Fangjin Yang

unread,
Aug 24, 2015, 9:52:40 PM8/24/15
to Druid User
Hi Jz,

You'll need to do 3 things to use HDFS for deep storage.

1) Include the HDFS extension in your list of extensions.
2) Set the proper configs to HDFS
3) Include relevant hadoop configuration files in the classpath of the nodes you are using.

If this doesn't work, can you share your ingestion spec?

Jz shen

unread,
Aug 25, 2015, 1:44:28 AM8/25/15
to Druid User
Hi, Fangjin,

Thanks a lot. What is "proper configs" in step 2) ?

Besides, i do the step 1) and 3),  but Druid realtime occurs No filesystem scheme: hdfs. I guess it must be that i set the hadoop configuration in the classpath by the wrong way.there is part of my _common/common.runtime.properties and the command line i run historical(i copy all the XMLs of hadoop in a directory call hadoopConf under druid/config )

config file:
druid.extensions.coordinates=["io.druid.extensions:druid-examples","io.druid.extensions:druid-kafka-eight","io.druid.extensions:mysql-metadata-storage","io.druid.extensions:druid-hdfs-storage:0.8.0"]
# Zookeeper
druid.zk.service.host=slaver2

# Metadata Storage (mysql)
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=druid
# Deep storage (local filesystem for examples - don't use this in production)
###
#
# storageType = local
#
#druid.storage.type=local
#druid.storage.storageDirectory=/usr/local/mrshen/druid-0.8.0/localStorage
#
#
##

###
#
# storageType = hdfs
#
druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://slaver2:9000/mrshen

way to boot historical:
#! /bin/bash
HADOOP_OPTS=config/hadoopConf:/usr/local/hadoop/lib/*:/usr/local/share/common/*:/usr/local/hadoop/share/hdfs/*:/usr/local/hadoop/share/httpfs/*:/usr/local/hadoop/share/kms/*:/usr/local/hadoop/mapreduce/*:/usr/local/hadoop/tools/*:/usr/local/hadoop/share/yarn/*
java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath config/_common:config/historical:lib/*:${HADOOP_OPTS} io.druid.cli.Main server historical

and the below is the log of realtime:
2015-08-25T10:58:01,224 ERROR [wikipedia-2015-08-25T10:50:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[wikipedia]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2015-08-25T10:50:00.000Z/2015-08-25T10:55:00.000Z}
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304) ~[?:?]
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311) ~[?:?]
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) ~[?:?]
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) ~[?:?]
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) ~[?:?]
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) ~[?:?]
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) ~[?:?]
at io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:83) ~[?:?]
at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:456) [druid-server-0.8.0.jar:0.8.0]
at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.8.0.jar:0.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [?:1.7.0_51]
2015-08-25T10:58:01,237 INFO [wikipedia-2015-08-25T10:50:00.000Z-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2015-08-25T10:58:01.237Z","service":"realtime","host":"localhost:8084","severity":"component-failure","description":"Failed to persist merged index[wikipedia]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.IOException","exceptionMessage":"No FileSystem for scheme: hdfs","exceptionStackTrace":"java.io.IOException: No FileSystem for scheme: hdfs\n\tat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304)\n\tat org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311)\n\tat org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)\n\tat org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)\n\tat org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)\n\tat org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)\n\tat org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)\n\tat io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:83)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:456)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2015-08-25T10:50:00.000Z/2015-08-25T10:55:00.000Z"}}]
2015-08-25T11:00:00,250 INFO [chief-wikipedia[0]] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2015-08-25T11:00:00.000Z_2015-08-25T11:05:00.000Z_2015-08-25T11:00:00.000Z] at path[/druid/segments/localhost:8084/localhost:8084_realtime__default_tier_2015-08-25T10:48:44.407Z_9aa0cac5257a4091b8e0cfa8b4d050f70]
 
在 2015年8月25日星期二 UTC+8上午9:52:40,Fangjin Yang写道:

Jz shen

unread,
Aug 25, 2015, 3:32:10 AM8/25/15
to Druid User
Hi, Fangjin,

Thanks for your help, i make it !

As for the no filesystem scheme: hdfs exception, it's because the losing setting for fs.hdfs.impl in hadoop core-site.xml, now i add it into core-site.xml and it work !


在 2015年8月25日星期二 UTC+8上午9:52:40,Fangjin Yang写道:
Hi Jz,

Fangjin Yang

unread,
Aug 26, 2015, 12:41:13 AM8/26/15
to Druid User
Hi Jz, great to hear you got it working!

shen Jz

unread,
Dec 1, 2016, 1:29:11 AM12/1/16
to druid...@googlegroups.com
Hi FJ,

    These days i took a try for the deployment of druid-0.9.1, i successfully started up the coordinator, broker, overload, middleManager, and loaded batch example data refer to the document "quickstart", also i could see "SUCCESS" in the page http://localhost:8090/console.html.
    However, when i tried to test the streaming data using tranquility tool as the document telling, an exception occurred as following:

    i listed the id.druid.service.jar under tranquility/lib and indeed there wasn't such file. What's more, i copyed other .jar file including the AbstractTask class under the ${druid.home}/lib, but i still didn't work.


    pls help and give me some advice if u have any idea with it, thanks a lot.

Yours


--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/LNn3vkMmzp4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

David Lim

unread,
Dec 15, 2016, 7:13:14 PM12/15/16
to Druid User
Hey Jz, have you figured this out yet? It looks like you're trying to include the kafka-indexing-service extension into Tranquility which is not supported. The Kafka indexing service is independent from Tranquility and the extension should be loaded onto the overlord and middle manager nodes.
To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

shen Jz

unread,
Dec 23, 2016, 12:51:45 AM12/23/16
to druid...@googlegroups.com
Hi David,

    Finally i worked it out by replace tranquility with a version-0.8.0 one, it seems like version compatibility problem.
    Thanks for your attention!

yours
Jz

To unsubscribe from this group and all its topics, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages