standard hbase client, asynchbase client, netty and direct memory buffers

647 views
Skip to first unread message

Jonathan Payne

unread,
Oct 21, 2011, 9:30:40 PM10/21/11
to AsyncHBase
I thought I'd take a moment to explain what I discovered trying to track down serious problems with the regular (non-async) hbase client and Java's nio implementation.

We were having issues running out of direct memory and here's a stack trace which says it all:

        java.nio.Buffer.<init>(Buffer.java:172)
        java.nio.ByteBuffer.<init>(ByteBuffer.java:259)
        java.nio.ByteBuffer.<init>(ByteBuffer.java:267)
        java.nio.MappedByteBuffer.<init>(MappedByteBuffer.java:64)
        java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:97)
        java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
        sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:155)
        sun.nio.ch.IOUtil.write(IOUtil.java:37)
        sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        java.io.DataOutputStream.flush(DataOutputStream.java:106)
        org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:518)
        org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:751)
        org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        $Proxy11.getProtocolVersion(<Unknown Source>:Unknown line)

Here you can see that an HBaseClient request is flushing a stream which has a socket channel at the other end of it. HBase has decided not to use direct memory for its byte buffers which I thought was smart since they are difficult to manage. Unfortunately, behind the scenes the JDK is noticing the lack of direct memory buffer in the socket channel write call, and it is allocating a direct memory buffer on your behalf! The size of that direct memory buffer depends on the amount of data you want to write at that time, so if you are writing 1M of data, the JDK will allocate 1M of direct memory.

The same is done on the reading side as well. If you perform channel I/O with a non-direct memory buffer, the JDK will allocate a direct memory buffer for you. In the reading case it allocates a size that equals the amount of room you have in the direct memory buffer you passed in to the read call. WTF!? That can be a very large value.

To make matters worse, the JDK caches these direct memory buffers in thread local storage and it caches not one, but three of these arbitrarily sized buffers. (Look in sun.nio.ch.Util.getTemporaryDirectBuffer and let me know if I have interpreted the code incorrectly.) So if you have a large number of threads talking to hbase you can find yourself overflowing with direct memory buffers that you have not allocated and didn't even know about.

This issue is what caused us to check out the asynchbase client, which happily didn't have any of these problems. The reason is that asynchbase uses netty and netty knows the proper way of using direct memory buffers for I/O. The correct way is to use direct memory buffers in manageable sizes, 16k to 64k or something like that, for the purpose of invoking a read or write system call. Netty has algorithms for calculating the best size given a particular socket connection, based on the amount of data it seems to be able to read at once, etc. Netty reads the data from the OS using direct memory and copies that data into Java byte buffers.

Now you might be wondering why you don't just pass a regular Java byte array into the read/write calls, to avoid the copy from direct memory to java heap memory, and here's the story about that. Let's assume you're doing a file or socket read. There are two cases:

  • If the amount being read is < 8k, it uses a native char array on the C stack for the read system call, and then copies the result into your Java buffer.
  • If the amount being read is > 8k, the JDK calls malloc, does the read system call with that buffer, copies the result into your Java buffer, and then calls free.

The reason for this is that the the compacting Java garbage collector might move your Java buffer while you're blocked in the read system call and clearly that will not do. But if you are not aware of the malloc/free being called every time you perform a read larger than 8k, you might be surprised by the performance of that. Direct memory buffers were created to avoid the malloc/free every time you read. You still need to copy but you don't need to malloc/free every time.

People get into trouble with direct memory because you cannot free them up when you know you are done. Instead you need to wait for the garbage collector to run and THEN the finalizers to be executed. You can never tell when the GC will run and/or your finalizers be run, so it's really a crap shoot. That's why the JDK caches these buffers (that it shouldn't be creating in the first place). And the larger your heap size, the less frequent the GCs. And actually, I saw some code in the JDK which called System.gc() manually when a direct memory buffer allocation failed, which is an absolute no-no. That might work with small heap sizes but with large heap sizes, a full System.gc() can take 15 or 20 seconds. We were trying to use the G1 collector which allows for very large heap sizes without long GC delays, but those delays were occurring because some piece of code was manually running GC. When I disabled System.gc() with a command line option, we ran out of direct memory instead.

This is long but I hope informative.

Stack

unread,
Oct 22, 2011, 12:05:06 AM10/22/11
to Jonathan Payne, AsyncHBase
This is great stuff Jonathan. Mind if I forward to the hbase dev list
(or probably better if you do it -- if you don't mind).

Thanks for taking the time to write it up.

St.Ack

Reply all
Reply to author
Forward
0 new messages