Did you actually test anything 2.0 specific?
Yes, that's understandable. Right now you'd have to do the
pre-aggregation either separately (e.g. as a process that reads the
fine-grained data and stores back a 5-minute average under a separate
metric name) or within your collector (simpler I think).
Can you please outline which symptoms you observed, other than the
single-thread at 100% you mentioned below?
> parallel workers to create my load, and sometimes TSD gets into a mode where
> it just single threads at 100% for a period before running again in parallel
> threads.
This is not supposed to happen. Sounds like a possible GC issue.
Have you looked at the GC log? Were you somehow running TSD out of
memory?
I'm not sure why the region size would cause TSD to go mental and get
~stuck in a 100% CPU loop. It also seems counterintuitive that you
got better results with smaller region sizes. This is the opposite of
what we generally observe.
If you want to do high-throughput write tests with any HBase
application, you first need to pre-split a few regions. Have you done
that, or did you always start from an empty table?
> The other scenario was when we had to restart
> HBase, and TSD could not re-establish a connection.
What was happening then?
I can send you the table creation script I use to pre-split. I should be able to do that in an hour or so.
I was mostly interested in the api/query endpoint for the JSON response and decoupling of gnuplot. We also plan on using the metadata features, and have the tracking enabled. Now that I think of it, I read somewhere that the current metadata code has locking, and perhaps that is contributing to my performance problems.
We already have concerns over the limited metric namespace, so using it to capture the different levels of aggregation will put additional pressure on that front. We also want to keep the collector lean to limit the overhead on the monitored system. Regardless, at the current state, we'd have to invest in some R&D to add this feature, so it will factor into the decision at the end.
A lot of these, but it seems these might be normal:
2013-05-04 01:02:31,057 INFO [New I/O worker #51] HBaseClient: There are now 9000 RPCs pending
![]() |
Jonathan Creasy | Sr. Ops Engineer e: j...@box.com | t: 314.580.8909 |
I'm not sure what message you're referring to when you're saying "it
could not retrieve logs". Was this an HBase message?
The messages that say "WTF" are surely from asynchbase or OpenTSDB.
They indicate something really unexpected happened, and most of them
should be treated as bugs. If you can provide more details
surrounding each WTF message, that would be helpful.
> I attached jconsole to TSD but didn't see anything with GC. Could it be the
> metadata locks?
No, I can't really see how any sort of lock could cause TSD to end up
in a state where a single thread is spinning. If it wasn't GC, then
what was it? Have you captured a few stack traces (with jstack -l)
while the problem was occurring? It could help us understand what's
going on there.
This one comes from asynchbase and seems to indicate that HBase
unexpectedly closed a connection from TSD. This is not supposed to
happen, unless maybe your HBase RegionServers are crashing. Was there
any crash or anything unusual on the HBase side?
> This type of message gets repeated many times in my log files:
>
> 2013-05-04 12:49:36,034 ERROR [New I/O worker #52] HBaseClient: WTF? Trying
> to add AtomicIncrementRequest(table="tsdb-uid", key=[0, 1, 39, 0, 0, 1, 0,
> 0, 1, 0, 0, 2, 0, 0, 11, 0, 0, 3, 0, 0, 59], family="name", qualifier="ts_ct
> r", amount=1, attempt=5, region=RegionInfo(table="tsdb-uid",
> region_name="tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0.",
> stop_key=[0, 4, -33, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 28, 0, 0, 3, 0, 0,
> 14])) twice to NSREd
> RPC on "tsdb-uid,,1367626896985.7e0e16f85dd9310451d3b155acb665d0."
This is not supposed to happen. Chris, can you think of a case where
we potentially send the same AtomicIncrementRequest twice? The
HBaseRpc objects can be re-used, but they cannot be used multiple
times concurrently. In other words, if you want to issue two
identical AtomicIncrementRequests at the same time, you cannot just
hand the same RPC twice to asynchbase. I'm not sure the code protects
itself against such usage, but I imagine it's one way the "WTF"
message above could occur.
final AtomicIncrementRequest inc = new AtomicIncrementRequest(
tsdb.uidTable(), tsuid, FAMILY, COUNTER_QUALIFIER);
tsdb.getClient().bufferAtomicIncrement(inc).addCallback(
new TSMetaCB(tsdb, tsuid));
My test loads ran for about 15 hours this time before falling over. I've attached my TSD log file with all the INFO records (quite chatty with all the network activity logging) filtered out. It looks like the system might have gotten busy with the initial set of errors, then snowballed to the point of failing over with the WTF messages. I don't think this is a network problem as we were running smoothly until things fell apart