On Mon, Jun 11, 2012 at 10:31 AM, Nick Telford <
nick.t...@gmail.com> wrote:
> While we love the high-throughput and non-blocking nature of asynchbase,
> we're concerned about the scenario where clients attempt to write with a
> higher throughput than the cluster can handle, usually transiently, e.g.
> during major compactions, heavy GC-load, moving region servers etc.
>
> What's the current situation, does the client buffer writes until they
> complete? Will it continue buffering until OOM if it can't complete them
> fast enough?
The only really problematic case is when a region is moving (due to
balancing or RS failure) or is being split. There is another
interesting problematic case that can happen if you try to suddenly
send a very large number of RPCs, each to a different key of a given
table, on a freshly created HBaseClient.
Let me first explain briefly the life of an RPC:
When you give an RPC to HBaseClient, it first looks up in its META
cache the region where it thinks this RPC ought to go. In a long
running application, chances are that the client already knows all the
regions in its working set, and already has an open connection the
right RegionServer. In that case the RPC goes straight to the right
RegionClient, which serializes it and sends it out to the network. At
this point, only 2 things are left in memory: the actual object of the
RPC being sent, along with an entry in some map that keeps track of
RPC IDs assigned associated with outstanding ("in-flight") RPCs.
There is currently no limit on the number of RPCs that can be
outstanding at the same time. If you look at RegionClient.java,
you'll find this:
// TODO(tsuna): Check the size() of rpcs_inflight. We don't want to have
// more than M RPCs in flight at the same time, and we may be overwhelming
// the server if we do.
What this means is that yes it is currently possible that your
application would run itself out of memory if it keeps producing and
sending new RPCs faster than the server can handle. As the TODO shows
above, there is currently no mechanism in asynchbase to prevent this
from happening. It's not hard to put a limit on the number of RPCs in
flight, what's hard and the reason why I haven't done this yet is:
what do you do once you hit that limit? The premise of asynchbase is
to never block, so how do we relay some back pressure to the caller to
tell them to slow down?
In the event that the META cache didn't know where the key was, a META
lookup needs to occur. This can also be problematic. To see how,
create a new HBaseClient, and have a tight loop create 1000000 RPCs,
each for a different key, and send them all to the HBaseClient. The
client will think it needs to do one META lookup per RPC, which would
be really bad and cause what I call a "META storm", as all of a sudden
a bazillion lookups hit the META region. In order to try to mitigate
this, and because META lookups are generally fast, there is a
rate-limit mechanism which is explained in the javadoc of the
acquireMetaLookupPermit method in RegionClient.java. The idea is that
there is a semaphore that controls the number of META lookups that can
happen concurrently, and when the semaphore runs out of permits,
subsequent META lookups get a 5ms penalty. In this case I "cheated"
because I'm actually blocking, albeit only for 5ms. I found this
strategy to be rather effective as it significantly reduces the
magnitude of "META storms" when they happen.
In my experience however, the concerns above, although they're real,
don't tend to cause big practical production issues. The problems
tend to come from region unavailability due to splits / balancing /
failures.
What happens then is that the RegionServer will return a
NoSuchRegionException (during balancing / splits), or the connection
to the RegionServer will fail if the RegionServer is dead (the best
case is because the JVM of the RS is dead, so the OS will just reset
all the connections, the worst case is if the JVM isn't dead or the
machine is dead, in which case you have to wait until the TCP
connection keep-alive times out). Once a region is identified as
being unavailable, it is marked as such in HBaseClient's META cache,
so that all subsequent RPCs to this region don't even go to the
network. Instead, they get diverted into a queue where they wait
until the region comes back online. This queue has a hard-coded limit
of 10k RPCs per region. There is actually a low watermark (hardcoded
at 1k RPCs) and a high watermark (hardcoded at 10k RPCs). When the
low watermark is hit, the 1000th RPCs to try to enter the queue will
immediately fail with a PleaseThrottleException, which you can handle
by adding an errback to the Deferred of your RPCs. Then subsequent
RPCs are also queued until the high watermark is hit, after which all
subsequent RPCs will fail-fast with a PleaseThrottleException.
The idea behind PleaseThrottleException is that this is how asynchbase
is trying to apply non-blocking back pressure on the caller when
things are backing up too much. When you get it, you can handle it
however you want. Typically what I would recommend you do is that you
turn a boolean to mark the server as unhealthy, so that it stops
receiving traffic.
I hope this helped shed some light on the current state of things.
There is obviously a lot of room for improvement, and I'm all hear if
you have cool ideas on how to properly apply back pressure in a
non-blocking fashion in various scenarios above, while mitigating the
risk of running ourselves out of memory.
--
Benoit "tsuna" Sigoure
Software Engineer @
www.StumbleUpon.com