Writing to underlying hbase cluster.

26 views
Skip to first unread message

Bryan H

unread,
Apr 10, 2015, 7:28:35 PM4/10/15
to open...@googlegroups.com
In the OpenTSDB documentation it says "Don't try to write directly to the underlying storage system, e.g. HBase. Just don't. It'll get messy quickly.".


I'd like to use our Hbase cluster for more than just OpenTSDB. If I am writing to different tables not used by OpenTSDB AND I'm using the same instance of HBaseClient (as the API clearly commands), will things still get messy? If so, what is the preferred method for utilizing the cluster for more than just OpenTSDB? 



Thanks,
Bryan

John A. Tamplin

unread,
Apr 10, 2015, 7:39:53 PM4/10/15
to Bryan H, OpenTSDB

It will work fine, other than interaction between traffic patterns.  If you are presplitting all your tables and aren't pushing lots of writes in your other app, then it shouldn't be too bad.

John A. Tamplin (phone)

Bryan Hernandez

unread,
Apr 10, 2015, 9:59:39 PM4/10/15
to John A. Tamplin, OpenTSDB
Thanks, John.  

I'm afraid I didn't explain my use case in enough detail.  In fact, I am using OpenTSDB to visualize events happening in time.  Because of OpenTSDB's design, there is a bit of manipulation I have to do in order to visualize these events.  For instance, saving values as floats drops precision, and in fact I have multiple values per datum.  That said, for visualization purposes, there's no problem in losing a bit of precision and in chopping up my events into multiple values on different time series, as has been suggested in many of email threads.  

For persistence purposes though, the trade-offs of using solely the OpenTSDB schema requires some compromises that are not quite right for my application's needs beyond the visualization of the data.  For instance, events occurring at the same time point can't be accommodated in OpenTSDB without some serious finagling.  Because of that, my plan was to use OpenTSDB for the realtime visualization of the approximate value of my events (not worrying about slight rounding errors and a few dropped time points).  Querying for exact values, however, would then happen from different HBase tables which have a different schema that is designed to handle event-based storage in time.  Row keys are going to be something like <unixTimestamp><dataSource><eventId>, where <eventId> increments only when there are more than one event occurring at the same unixTimestamp.  With pre-splitting of regions, this would make it so I can spread the load in time across more regions, rather than hitting one region very hard for some time and then switching to a virtually idle region for some time etc.  

The reason I explain all this is because a new event will trigger writes to both the TSDB tables as well as my custom HBase tables at pretty much the same time (since they are storing the same data).  Querying for the OpenTSDB UI will only be on the TSDB tables obviously, and queries for exact data will only be on the custom HBase tables.

Since this is an archival system, 99% of the operations are writes, and 1% is read for when we are fetching data in large batches or when the UI is querying for graphing.

Now with this more detailed explanation, do you foresee there being problems in supporting the custom HBase tables on the same cluster? 

Thanks again for the advice.

Best,

Bryan

ManOLamancha

unread,
Apr 23, 2015, 9:56:47 PM4/23/15
to open...@googlegroups.com, j...@jaet.org
On Friday, April 10, 2015 at 6:59:39 PM UTC-7, Bryan H wrote:
Thanks, John.  

I'm afraid I didn't explain my use case in enough detail.  In fact, I am using OpenTSDB to visualize events happening in time.  Because of OpenTSDB's design, there is a bit of manipulation I have to do in order to visualize these events.  For instance, saving values as floats drops precision, and in fact I have multiple values per datum.  That said, for visualization purposes, there's no problem in losing a bit of precision and in chopping up my events into multiple values on different time series, as has been suggested in many of email threads.  

For persistence purposes though, the trade-offs of using solely the OpenTSDB schema requires some compromises that are not quite right for my application's needs beyond the visualization of the data.  For instance, events occurring at the same time point can't be accommodated in OpenTSDB without some serious finagling.  Because of that, my plan was to use OpenTSDB for the realtime visualization of the approximate value of my events (not worrying about slight rounding errors and a few dropped time points).  Querying for exact values, however, would then happen from different HBase tables which have a different schema that is designed to handle event-based storage in time.  Row keys are going to be something like <unixTimestamp><dataSource><eventId>, where <eventId> increments only when there are more than one event occurring at the same unixTimestamp.  With pre-splitting of regions, this would make it so I can spread the load in time across more regions, rather than hitting one region very hard for some time and then switching to a virtually idle region for some time etc.  

The reason I explain all this is because a new event will trigger writes to both the TSDB tables as well as my custom HBase tables at pretty much the same time (since they are storing the same data).  Querying for the OpenTSDB UI will only be on the TSDB tables obviously, and queries for exact data will only be on the custom HBase tables.

Since this is an archival system, 99% of the operations are writes, and 1% is read for when we are fetching data in large batches or when the UI is querying for graphing.

Now with this more detailed explanation, do you foresee there being problems in supporting the custom HBase tables on the same cluster? 

Oh there's no problem in writing to other tables in HBase, that's perfectly fine. You just want to watch your traffic and make sure it's balanced nicely between servers. The warning I left in the docs is just to make sure folks don't mess with the data in TSDB's tables since it's very easy to mess up queries.

That being said, we can support more objects in the existing time series store and it would be great, if your use case is common, to build it into TSDB. Right now we have annotations stored in line with data and some folks have been storing other types as well. This has the advantage, over separate tables, of being able to scoop up all relevant data for a timespan in a single call.
Reply all
Reply to author
Forward
0 new messages