Realtime server not able to persist segments in HDFS Deepstorage

668 views
Skip to first unread message

Narayanan K

unread,
Jun 20, 2014, 1:02:05 PM6/20/14
to druid-de...@googlegroups.com
Hi 

I installed Druid Realtime Node and started the server with our kafka firehose type. Everything is running fine, except that it is not able to persist the segments it receives to deepstorage. 
--------


Our deepstorage is HDFS. 

Druid Version : 0.6.105
Hadoop Version : hadoop-1.0.2

runtime.properties
--------------------------


druid.pusher.hdfs=true
druid.pusher.hdfs.storageDirectory=hdfs://druid1.xyz.com:9100/druid06/deepstorage

druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://druid1.xyz.com:9100/druid06/deepstorage

druid.extensions.coordinates=["io.druid.extensions:druid-kafka-eight:0.6.105","io.druid.extensions:druid-hdfs-storage:0.6.105"]


-----

The exception in Logs:

14/06/20 16:42:36 ERROR RealtimePlumber: Failed to persist merged index [realtime_17]: 
{class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2014-06-20T12:00:00.000Z/2014-06-20T13:00:00.000Z}
java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:77)
        at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:349)
        at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

I also tried giving the hadoop client jar in the classpath. Still it is giving this exception and I think it is ending up persisting the segments in a local directory.

Is the druid-hdfs extension for 0.6.105 not compatible with our hadoop version ? 

Could you please let me know what is missing here ? 


Thanks
Narayan




Fangjin Yang

unread,
Jun 20, 2014, 1:21:17 PM6/20/14
to druid-de...@googlegroups.com
Hi Narayanan, Hadoop dependencies are always fun. How are you including 1.0.2 version of Hadoop with Druid? Is the HDFS storage module you are using from stock Druid?

Thanks,
FJ

Deepak Jain

unread,
Jun 20, 2014, 2:03:28 PM6/20/14
to druid-de...@googlegroups.com
Hello Narayan,
I had seen similar errors during past. 
run hadoop classpath
Include all the paths into cp while starting the java process. Do share the full command used.
Regards,
Deepak

Narayanan K

unread,
Jun 20, 2014, 2:23:36 PM6/20/14
to druid-de...@googlegroups.com
Hi Deepak

Will try out that and let you guys know. 

I see druid 6 uses Hadoop 2.3. As discussed with Fangjin, hadoop dependencies mismatch is a problem.

Thanks
Narayanan

Narayanan K

unread,
Jun 20, 2014, 4:59:14 PM6/20/14
to druid-de...@googlegroups.com
Thanks Deepak. That worked. 

I included hadoop classpath in realtime classpath and it is able to write to HDFS.

Narayanan

Punit Shah

unread,
Nov 18, 2014, 9:51:26 AM11/18/14
to druid-de...@googlegroups.com
Hello Folks,

I'm having a similar problem as described in this thread, in that I'm getting an IOException "java.io.IOException: No FileSystem for scheme: hdfs..."

I'm running hadoop 2.3.0 and druid 0.6.160.

I've configured hdfs deep storage for both the Overlord node and the Historical node.

On node launch I can see that the jvm is picking up the hadoop jars from my local maven repo.

But when I run the following curl command:
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_task.json localhost:8087/druid/indexer/v1/task

I run into the same hdfs type error you were seeing.

I've tried all sorts of things including appending jars to path.  And as an aside I initially started with hadoop 2.5.0 and then after seeing this thread downgraded to 2.3.0 to match the ones packages with druid 0.6.160.

I'm having no luck.  Not sure if this is important, but I'm running hdfs under a different user account than the druid service nodes... sure it doesn't matter 'cause I don't think it's even getting to the point of pushing out a segment.

Any help is greatly appreciated.

-- Punit...

Nishant Bangarwa

unread,
Nov 18, 2014, 10:02:29 AM11/18/14
to druid-de...@googlegroups.com
Hi Punit, 
looks like you are not adding io.druid.extensions:druid-hdfs-storage as a druid extension in your runtime.props. 
can you try adding druid-hdfs-storage as an extension ? 

If it still fails can you share the logs, runtime.props and the java classpath you are using ? 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/95d93a0d-9e19-47db-9e9f-0a8e9e427d84%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Punit Shah

unread,
Nov 18, 2014, 10:21:12 AM11/18/14
to druid-de...@googlegroups.com
Quick Response:  Thanks again.

I'm actually adding that to the historical and overlord runtime.props...  I see the 0.6.160 version of the io.druid.extensions being picked up from my local maven...

-- Punit....

Punit Shah

unread,
Nov 18, 2014, 12:58:48 PM11/18/14
to druid-de...@googlegroups.com
Hi Nishant,

I've attached a log file to this post.  

Here is the relevant runtime.properties for Overlord, followed by the command I used to launch it:
druid.extensions.coordinates=["io.druid.extensions:druid-hdfs-storage:0.6.160"]
druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://127.0.0.1:9000/druid

Here is the command I used to launch Overlord:
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath /home/hduser/hadoop/hadoop-2.3.0/etc/hadoop:lib/*:config/overlord io.druid.cli.Main server overlord


Here is the relevant runtime.properties for History, followed by the command I used to launch it:
druid.extensions.coordinates=["io.druid.extensions:druid-hdfs-storage:0.6.160"]
druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://127.0.0.1:9000/druid
# Dummy read only AWS account (used to download example data)
#druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
#druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ

Here is the command I used to launch History:
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath /home/hduser/hadoop/hadoop-2.3.0/etc/hadoop:lib/*:config/historical io.druid.cli.Main server historical


On Tuesday, November 18, 2014 10:02:29 AM UTC-5, Nishant Bangarwa wrote:
I68G3K~O.LOG

Punit Shah

unread,
Nov 18, 2014, 3:15:05 PM11/18/14
to druid-de...@googlegroups.com
To those interested here's an update:

I've managed to get hdfs integration working with Druid 0.6.160 and Hadoop 2.30.  I changed the runtime properties value of druid.storage.type = Hdfs as opposed to hdfs.

In my case the capitalization of 'H' seemed to do the trick.  It may have been a combination of things however, culminating in the case sensitivity change that did it.

Here's a net list of the summary of the changes I did:
1. Add the hadoop *site.xml path to the classpath executables for Historical and Overlord nodes
2 Changed druid.storage.type to Hdfs

I tried adding jars, etc. to classpaths but none of that seemed to matter.

-- Punit...

Fangjin Yang

unread,
Nov 18, 2014, 7:56:33 PM11/18/14
to druid-de...@googlegroups.com
Hi Punit, I believe what is actually happening is that Druid doesn't recognize "Hdfs" and is instead using the local filesystem (which is the default). When you include hadoop configuration files with your configuration, how do you verify that hdfs is not working?

Punit Shah

unread,
Nov 18, 2014, 9:04:02 PM11/18/14
to druid-de...@googlegroups.com

I attached logs to my previous post you can take a look at the errors I was seeing. Essentially it was stating that it did not recognize hdfs and erroring out.  And now it's using the hdfs path I pointed to in the config file.

You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/QpYq0eGDHa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

Punit Shah

unread,
Nov 19, 2014, 9:54:33 PM11/19/14
to druid-de...@googlegroups.com
Hello Fangjin,

You're right.  It's not saving to hdfs.  It doesn't seem to save anything at all.  For now I'll use local storage. But would you be able to help looking at the logs I posted yesterday?

Thanks.
-- Punit...

Fangjin Yang

unread,
Nov 20, 2014, 1:22:00 AM11/20/14
to druid-de...@googlegroups.com
I think it is choosing to use local storage for deep storage. I did look through the logs but need some more info.

Do you mind including a copy of your runtime.properties for the overlord and/or middle manager of you are running one of those.

Punit Shah

unread,
Nov 20, 2014, 7:51:57 AM11/20/14
to druid-de...@googlegroups.com
Hi Fangjin,

pls look at my earlier post... the one i attached the log in.   i included runtime props for both historical and overlord there.  thanks in advance.

Punit Shah

unread,
Nov 20, 2014, 10:20:07 AM11/20/14
to druid-de...@googlegroups.com
Upon further exam it appears the error's stemming from the HdfsDataSegmentPusher class and the org.apache.hadoop.conf.Configuration hadoopConfig object.  I'm assuming that it tries to get this object from the hadoop config path?  Is there any particular file it looks for?  My core-site.xml is like:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Punit Shah

unread,
Nov 20, 2014, 12:04:35 PM11/20/14
to druid-de...@googlegroups.com
Hi,

I was using the wrong indexing example task.  I was using the wikipedia_index_task when I should have been using the wikipedia_index_hadoop_task.

After switching... I made some more progress... now i'm getting a write access error... but atleast it's going through... i believe i can figure this out...

Hao Chen

unread,
Jan 10, 2015, 3:37:05 AM1/10/15
to druid-de...@googlegroups.com

Fangjin Yang

unread,
Jan 12, 2015, 1:02:26 PM1/12/15
to druid-de...@googlegroups.com
Thanks for the contrib! We'll try to get this merged asap

Sai Subramaniam

unread,
Sep 1, 2015, 2:55:52 AM9/1/15
to Druid Development
Greetings.

I am not able to persist the real-time segments in HDFS. I followed this thread but data is lost after closing the connection.

My runtime.properties:

# Metadata Storage (mysql)
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://127.0.0.1:3306/druid
druid.metadata.storage.connector.user=root
druid.metadata.storage.connector.password=hadoop

druid.pusher.hdfs=true
druid.pusher.hdfs.storageDirectory=hdfs://sandbox.hortonworks.com:8020/druid

druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://sandbox.hortonworks.com:8020/druid

druid.extensions.coordinates=["io.druid.extensions:druid-kafka-eight","io.druid.extensions:druid-hdfs-storage","io.druid.extensions:mysql-metadata-storage"]


Specs file:wikipedia_realtime.spec

[
   {
       "dataSchema": {
           "dataSource": "wikipedia",
           "parser": {
               "type": "string",
               "parseSpec": {
                   "format": "json",
                   "timestampSpec": {
                       "column": "timestamp",
                       "format": "auto"
                   },
                   "dimensionsSpec" : {
          "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
          "dimensionExclusions" : [],
          "spatialDimensions" : []
        }
      }
    },
           "metricsSpec": [
               {
                   "type": "count",
                   "name": "count"
               }
           ],
           "granularitySpec": {
               "type": "uniform",
               "segmentGranularity": "DAY",
               "queryGranularity": "NONE"
           }
       },
       "ioConfig": {
           "type": "realtime",
           "firehose": {
               "type": "kafka-0.8",
               "consumerProps": {
                   "zookeeper.connect": "localhost:2181",
                   "zookeeper.connection.timeout.ms": "15000",
                   "zookeeper.session.timeout.ms": "15000",
                   "zookeeper.sync.time.ms": "5000",
                   "group.id": "wikipedia",
                   "fetch.message.max.bytes": "1048586",
                   "auto.offset.reset": "largest",
                   "auto.commit.enable": "false"
               },
               "feed": "wikipedia"
           },
           "plumber": {
               "type": "realtime"
           },
      "children": [
          {
        "type" : "dataSource",
        "ingestionSpec" : {
            "dataSource": "wikipedia"
        }
         },
          {
        "type" : "hadoop",
        "paths": "hdfs://sandbox.hortonworks.com:8020/druid/wikipedia_data.json"
      }
      ],
    "metadataUpdateSpec" : {
      "type":"mysql",
      "connectURI" : "jdbc:mysql://127.0.0.1:3306/druid",
      "password" : "hadoop",
      "segmentTable" : "druid_segments",
      "user" : "root"
    },
    "segmentOutputPath" : "hdfs://sandbox.hortonworks.com:8020/druid/outputSegment"
       },
       "tuningConfig": {
           "type": "realtime",
           "maxRowsInMemory": 500000,
           "intermediatePersistPeriod": "PT10m",
           "windowPeriod": "PT10m",
           "basePersistDirectory": "baseDir",
           "rejectionPolicy": {
               "type": "messageTime"
           }
       }
   }
]


I am using HDP sandbox which contains Hadoop 2.6.0. I have included the configs in the classpath while running:

java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8              -Ddruid.realtime.specFile=examples/wikipedia/wikipedia_realtime.spec      -classpath "config/_common:config/realtime:lib/*:/usr/hdp/2.2.4.2-2/hadoop/conf:/usr/hdp/2.2.4.2-2/hadoop/lib/*:/usr/hdp/2.2.4.2-2/hadoop/.//*:/usr/hdp/2.2.4.2-2/hadoop-hdfs/./:/usr/hdp/2.2.4.2-2/hadoop-hdfs/lib/*:/usr/hdp/2.2.4.2-2/hadoop-hdfs/.//*:/usr/hdp/2.2.4.2-2/hadoop-yarn/lib/*:/usr/hdp/2.2.4.2-2/hadoop-yarn/.//*:/usr/hdp/2.2.4.2-2/hadoop-mapreduce/lib/*:/usr/hdp/2.2.4.2-2/hadoop-mapreduce/.//*::/usr/share/java/mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-mapreduce-client/*:/usr/hdp/current/tez-client/*:/usr/hdp/current/tez-client/lib/*:/etc/tez/conf/:/usr/hdp/2.2.4.2-2/tez/*:/usr/hdp/2.2.4.2-2/tez/lib/*:/etc/tez/conf"               io.druid.cli.Main server realtime

Druid version: 0.7.3

Incidently, no table is created in MySQL druid database. Is there anything I missed??

Thanks.

Sai Subramaniam

unread,
Sep 1, 2015, 4:27:04 AM9/1/15
to Druid Development
Hi,

After reading on FJ's post,.. real-time nodes periodically create segments at the end of the window period + segment granularity period. These segments contain data ingested for some period of time.
I was able to persist to it local disk, but I encounter this error when I try to save in hdfs.,

Exception in thread "plumber_merge_7" java.lang.NoSuchMethodError: org.apache.hadoop.ipc.RPC.getProtocolProxy(Ljava/lang/Class;JLjava/net/InetSocketAddress;Lorg/apache/hadoop/security/UserGroupInformation;Lorg/apache/hadoop/conf/Configuration;Ljavax/net/SocketFactory;ILorg/apache/hadoop/io/retry/RetryPolicy;Ljava/util/concurrent/atomic/AtomicBoolean;)Lorg/apache/hadoop/ipc/ProtocolProxy;
    at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:420)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:316)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:665)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:601)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)

    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:73)
    at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:456)
    at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Please help in figuring out the solution. Thank you!
Reply all
Reply to author
Forward
0 new messages