Titan 0.5.0 + Hbase : Failing Region Servers

637 views
Skip to first unread message

Guy Taylor

unread,
Aug 21, 2014, 6:06:57 AM8/21/14
to aureliu...@googlegroups.com
Hi all,

I recently set up 0.5.0 with HBase, I got my imports working nicely but now when I try and query it:

g = HadoopFactory.open('/home/titan/conf/hadoop/titan-hbase-input-output.properties’)

and then whether I do:

g.E.type.groupCount or g.V.type.groupCount or v = g.v(4)

I get to a 17% Map - 6% reduce and my region servers start failing silently.

Hbase is fine, I can do standard lookups with Hbase. My titan-hbase-input-output.properties looks like this:

# input graph parameters
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseInputFormat
titan.hadoop.input.conf.storage.backend=hbase
titan.hadoop.input.conf.storage.hostname=master.tyme-data.com,nn1.tyme-data.com,nn2.tyme-data.com
titan.hadoop.input.conf.storage.port=2181
titan.hadoop.input.conf.storage.hbase.table=titan
# hbase.mapreduce.scan.cachedrows=1000

titan.hadoop.pipeline.track-state=true

# output data (graph or statistic) parameters
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseOutputFormat
titan.hadoop.output.conf.storage.backend=hbase
titan.hadoop.output.conf.storage.hostname=master.tyme-data.com,nn1.tyme-data.com,nn2.tyme-data.com
titan.hadoop.output.conf.storage.port=2181
titan.hadoop.output.conf.storage.hbase.table=titan
titan.hadoop.output.conf.storage.batch-loading=true
titan.hadoop.output.infer-schema=true
# controls size of transaction
mapred.max.split.size=5242880
# mapred.reduce.tasks=10
mapred.job.reuse.jvm.num.tasks=-1
mapred.linerecordreader.maxlength=5242880
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
mapred.map.tasks=6
mapred.reduce.tasks=3
mapred.job.reuse.jvm.num.tasks=-1
mapred.task.timeout=5400000
mapred.reduce.parallel.copies=50
io.sort.factor=100
io.sort.mb=200

titan.hadoop.sideeffect.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
root.storage.hbase.ext.hbase.zookeeper.property.clientPort = 2181

I’ve been trying to debug this for a couple of days now, and I’m starting to get a bit unstuck. Any help, or pointers in the right direction, would be appreciated.

--

Guy Taylor

unread,
Aug 21, 2014, 9:46:39 AM8/21/14
to aureliu...@googlegroups.com
Following this,

My first stack trace looks like this:

Error: java.lang.IllegalArgumentException: Could not instantiate implementation: com.thinkaurelius.titan.hadoop.formats.util.input.current.TitanHadoopSetupImpl
at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:55)
at com.thinkaurelius.titan.hadoop.formats.util.TitanInputFormat.setConf(TitanInputFormat.java:44)
at com.thinkaurelius.titan.hadoop.formats.hbase.TitanHBaseInputFormat.setConf(TitanHBaseInputFormat.java:49)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:44)
... 11 more
Caused by: java.lang.IllegalArgumentException: Could not instantiate implementation: com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager
at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:55)
at com.thinkaurelius.titan.diskstorage.Backend.getImplementationClass(Backend.java:425)
at com.thinkaurelius.titan.diskstorage.Backend.getStorageManager(Backend.java:366)
at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.<init>(GraphDatabaseConfiguration.java:1208)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:92)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:81)
at com.thinkaurelius.titan.hadoop.formats.util.input.current.TitanHadoopSetupImpl.<init>(TitanHadoopSetupImpl.java:37)
... 16 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:44)
... 22 more
Caused by: com.thinkaurelius.titan.diskstorage.PermanentBackendException: Permanent failure in storage backend
at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.<init>(HBaseStoreManager.java:327)
... 27 more
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:421)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnectionInternal(ConnectionManager.java:314)
at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:291)
at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.<init>(HBaseStoreManager.java:323)
... 27 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:419)
... 30 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.protobuf.generated.ClientProtos$Result$Builder.setStale(Z)Lorg/apache/hadoop/hbase/protobuf/generated/ClientProtos$Result$Builder;
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.<clinit>(ProtobufUtil.java:192)
at org.apache.hadoop.hbase.ClusterId.parseFrom(ClusterId.java:64)
at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:75)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:86)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:853)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:657)
... 35 more

Dan LaRocque

unread,
Aug 24, 2014, 9:52:00 PM8/24/14
to aureliu...@googlegroups.com
Hi Guy,

Looks like Hadoop 1.x from the config file. Please correct if that's
not the case.

Regarding the trace: From the limited information here, it looks like a
possible mismatch between hbase jar versions. Under 0.96 & 0.98,
hbase-protocol contains the protobuf generated classes, such as
ClientProtos$Result$Builder, while the separate hbase-client jar
contains ProtobufUtil. The linkage failure in the trace happens right
on that dividing line. It was a little different in 0.94, where I think
everything was packed into a single jar called hbase. I think
ClientProtos$Result$Builder also didn't exist back in 0.94. It looks
almost like ProtobufUtil from hbase-client 0.98/0.96 is trying to call
into older generated protobuf classes from 0.94, or some variation on
that idea. Not certain, just my best guess at the moment. Can you show
the complete classpath used by these MR task JVMs?

Jumping back to your previous email for a moment: I'm not sure what you
mean by regionservers failing silently. When the job runs, are you
saying that HBase's regionserver processes start dying without leaving
anything interesting in the regionserver .log or .out file?
Faunus/Titan mapreduce tasks write to HBase through the ordinary client
API. Titan shouldn't be crashing regionservers unless something's
seriously wrong.

thanks,
Dan

Guy Taylor

unread,
Aug 25, 2014, 10:03:50 AM8/25/14
to aureliu...@googlegroups.com
Hi Dan,

Thanks for getting back to me. You’re absolutely correct, and I’ve corrected this configs.

I did, in fact, locate the jar mismatches on Friday and resolved that issue. Thank you for the help, I managed to find that.

On the last note. Yes, I’m saying exactly that. The dataset it about 8GB (34 million vertices), and what tends to happen is the following.

I run:

gremlin> g = HadoopFactory.open('/home/titan/conf/hadoop/titan-hbase-input-output.properties')
==>titangraph[hadoop:titanhbaseinputformat->graphsonoutputformat]
gremlin> g._()

Which starts off the MR jobs. These jobs run for a couple of minutes when region server dies, with the following information in the region-manager’s log file:

2014-08-25 15:44:02,919 INFO [RpcServer.handler=40,port=60020] regionserver.HRegionServer: Client tried to access missing scanner 7706071149308167298
2014-08-25 15:44:03,635 INFO [RpcServer.handler=44,port=60020] compress.CodecPool: Got brand-new decompressor [.gz]
2014-08-25 15:44:30,736 WARN [RpcServer.handler=44,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":27742,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"10.0.30.5:56001","starttimems":1408974242948,"queuetimems":0,"class":"HRegionServer","responsesize":247,"method":"Scan”}

Which is followed by this exception in gremlin:

Error: org.apache.hadoop.hbase.client.ScannerTimeoutException: 74416ms passed since the last invocation, timeout is currently set to 60000

Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.UnknownScannerException): org.apache.hadoop.hbase.UnknownScannerException: Name: 3819231969995942960, already closed?

I’m unfortunately a bit new to Hadoop as well as Titan, and between MapReduce/Hbase/Titan it’s a lot of moving parts to understand. I’d be happy to provide more information, I’m not sure what information to supply however.

Also, in order to do queries will I have to use Faunus or will I be able to query from Titan’s gremlin.sh?
> --
> You received this message because you are subscribed to the Google Groups "Aurelius" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/53FA96BE.4080805%40thinkaurelius.com.
> For more options, visit https://groups.google.com/d/optout.

Dan LaRocque

unread,
Sep 1, 2014, 1:56:24 AM9/1/14
to aureliu...@googlegroups.com
Hi Guy,

On 08/25/2014 10:03 AM, Guy Taylor wrote:
> Hi Dan,
>
> Thanks for getting back to me. You're absolutely correct, and I've corrected this configs.
>
> I did, in fact, locate the jar mismatches on Friday and resolved that issue. Thank you for the help, I managed to find that.
>
> On the last note. Yes, I'm saying exactly that. The dataset it about 8GB (34 million vertices), and what tends to happen is the following.
>
> I run:
>
> gremlin> g = HadoopFactory.open('/home/titan/conf/hadoop/titan-hbase-input-output.properties')
> ==>titangraph[hadoop:titanhbaseinputformat->graphsonoutputformat]
> gremlin> g._()
>
> Which starts off the MR jobs. These jobs run for a couple of minutes when region server dies, with the following information in the region-manager's log file:
>
> 2014-08-25 15:44:02,919 INFO [RpcServer.handler=40,port=60020] regionserver.HRegionServer: Client tried to access missing scanner 7706071149308167298
> 2014-08-25 15:44:03,635 INFO [RpcServer.handler=44,port=60020] compress.CodecPool: Got brand-new decompressor [.gz]
> 2014-08-25 15:44:30,736 WARN [RpcServer.handler=44,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":27742,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"10.0.30.5:56001","starttimems":1408974242948,"queuetimems":0,"class":"HRegionServer","responsesize":247,"method":"Scan"}

I'm not familiar with those particular HBase log messages, but they
don't look fatal. Have you checked to see whether the OS kernel is
terminating these regionserver processes due to resource constraints,
such as memory pressure? Not sure what platform you're on, but the
Linux kernel OOM killer prints a diagnostic to the kernel ring buffer
(dmesg and sometimes also a file in /var/log/) every time it kills a
process.

> Which is followed by this exception in gremlin:
>
> Error: org.apache.hadoop.hbase.client.ScannerTimeoutException: 74416ms passed since the last invocation, timeout is currently set to 60000
> ...
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.UnknownScannerException): org.apache.hadoop.hbase.UnknownScannerException: Name: 3819231969995942960, already closed?
>
> I'm unfortunately a bit new to Hadoop as well as Titan, and between MapReduce/Hbase/Titan it's a lot of moving parts to understand. I'd be happy to provide more information, I'm not sure what information to supply however.
>
> Also, in order to do queries will I have to use Faunus or will I be able to query from Titan's gremlin.sh?

I don't think I understand this last question. In 0.5.0 you can use
HadoopFactory.open and TitanFactory.open from the same gremlin.sh REPL
in bin. Both interfaces can read/write the same underlying graph.

thanks,
Dan

Guy Taylor

unread,
Sep 3, 2014, 10:51:30 AM9/3/14
to aureliu...@googlegroups.com
Hi Dan,

On 01 Sep 2014, at 7:56 AM, Dan LaRocque <d...@thinkaurelius.com> wrote:

>> 2014-08-25 15:44:02,919 INFO [RpcServer.handler=40,port=60020] regionserver.HRegionServer: Client tried to access missing scanner 7706071149308167298
>> 2014-08-25 15:44:03,635 INFO [RpcServer.handler=44,port=60020] compress.CodecPool: Got brand-new decompressor [.gz]
>> 2014-08-25 15:44:30,736 WARN [RpcServer.handler=44,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":27742,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"10.0.30.5:56001","starttimems":1408974242948,"queuetimems":0,"class":"HRegionServer","responsesize":247,"method":"Scan"}
>
> I'm not familiar with those particular HBase log messages, but they don't look fatal. Have you checked to see whether the OS kernel is terminating these regionserver processes due to resource constraints, such as memory pressure? Not sure what platform you're on, but the Linux kernel OOM killer prints a diagnostic to the kernel ring buffer (dmesg and sometimes also a file in /var/log/) every time it kills a process.

Yeah, I can’t find anything particular there. I tried a standard pig query against the ‘titan’ db in Hbase, and I had a similar fallover, I then tried it against the db which was used to insert into the db and it was fine.

I switched over to cassandra to try it this afternoon. All has been fine since. I need to think a bit about a new architecture though.


>>
>> Also, in order to do queries will I have to use Faunus or will I be able to query from Titan's gremlin.sh?
>
> I don't think I understand this last question. In 0.5.0 you can use HadoopFactory.open and TitanFactory.open from the same gremlin.sh REPL in bin. Both interfaces can read/write the same underlying graph.

You answered it perfectly, my thanks.

Thank you for all the help.
Guy
Reply all
Reply to author
Forward
0 new messages