Although this discussion originally started off-list, I'd encourage
everyone involved to keep the discussions on the mailing list so that
everyone can contribute or just throw in their 2¢.
> For storage, what’s the best way to do this do you think? I initially
> thought of creating a separate HBase table for each rollup interval.
> Then we’d use a similar naming schema as the raw table (minus the
> timestamp) and create a new cell for each timestamp in that interval.
> The API could look at a timespan and automatically determine which
> table to pull from unless the user overrides the choice with some
> flags.
> Or would it be better to store all of the rollups in a single table
> with a slightly different key naming convention ala metricID/
> rollup_type/tagIDs?
From the off-list discussion earlier:
On Tue, Jan 24, 2012 at 6:58 AM, ManOLamancha <clars...@gmail.com> wrote:
> On Tue, Jan 24, 2012 at 1:24 AM, Benoit Sigoure <ts...@stumbleupon.com> wrote:
>> I would advise against using separate tables for rollups. HBase
>> generally performs better with a single table. Rollups could be
>> stored either under a different metric name (e.g. by appending
>> ".hourly" at the end of the metric for which we're doing hourly
>> rollups), or in a different column family, or simply by structuring
>> the key-space differently.
>>
>> I like the idea of introducing new metric names for rollups as it'll
>> keep the data locality good, and yet still distribute it well across
>> the table.
>
> Regarding the rollup storage, you definitely have more experience with Hbase than I do, but is it really faster to load the rollups from the same table as all the raw values? I thought that having a separate table with a smaller btree index on the rows would make for faster access when performing row scans. Or is there some other consideration like region distribution or caching that lets a single giant table perform better than split tables?
Given OpenTSDB's schema, there are virtually no performance difference
between using a single table or multiple tables. This is especially
true if you store a LOT of data. In practice the table for yearly
rollups will be much smaller, so accesses to it might be a bit faster,
but in the grand scheme of thing, this will be insignificant.
Using a single table for all the data makes life a lot easier from an
operational standpoint. It's easier to manage, troubleshoot, account
for, capacity plan for. It will create fewer regions and utilize them
better. Region count becomes a real problem when you start running on
large data sets. The solution is typically to create fewer bigger
regions by increasing the max region size and then merging regions
together. Having a single table for all the data makes this process
simpler, should you have to go through it.
So even if there are small theoretical performance gains, which isn't
even guaranteed, I think the pros of using a single table outweigh the
cons.
Also, using a single table will also keep the code a bit simpler as we
won't have to do "if (blah) then use this table; else if (blah) then
use that table; else if …"
> Also, for the config and other data that would have a pretty small footprint, is it alright to split those into separate tables or is it better to have one table for all of that data? I want to figure out how to design the schema. Thanks again for your feedback!
There is already a separate table for meta data, called "tsdb-uid".
Granted the name implies that it's only for UID mappings, but I would
recommend we put all the meta data in there anyway. A lot of the
meta-data applies per-metric/tag name/tag value, and that table is a
great fit to store it as it already has one row per metric/tag
name/tag value. We'd simply add more columns to these rows.
> For storing the actual values, we could use an 8|8|8 byte format where
> the values are min|max|avg.
> The metric meta data would store a cell with flags about which rollups
> to perform, configured by the user. E.g. the user could, through the
> API or GUI, tell it to rollup all “if.bytes.in” metrics on a daily and
> monthly basis. A thread would have to walk the metadata and update
> each metric of that type.
> Aggregates
>
> We want to use OpenTSDB for capacity planning and properly doing so
> requires that we be able to quickly and easily see a graph with the
> total amount of some data for a service or platform. For example, we
> may want to know the total traffic in and out of a specific platform
> that has over 10,000 devices. Running a scan to fetch all of that
> data, then downsample and generate a grapth would take a looooong
> time, so I think it would be better to have rules, acting on real-time
> data, that perform an aggregation and save the data as another metric
> to the TSDB. I would use the same raw and rollup schemas but add a
> special tag, maybe “tsdagg” that lets us know it’s an aggregate value.
> Metadata
I would also implement this with a modified metric name, e.g. by
adding a ".sum" suffix if you're doing aggregation by sum.
> I would like to add meta data cells to the “metrics” and “tagk” rows
> that can be displayed in the GUI or API. I would have the TSDs loading
> from HBase every 5 minutes or so (configurable). Would it be better to
> have a separate cell for each value or maybe have one extra cell that
> includes a JSON string with the different values?
Meta data could be stored in separate cells yes. You could always
reconstruct the JSON object by doing a get on all the cells.
The TSD's GUI already does everything it does through an API. It's
not cheating. But I agree we need more APIs to make it easier to
build third party GUIs.
For the modules, I feel like it's simpler if it's the same binary that
can do everything, just behaves the way you want depending on how you
invoke it.
I like the idea of persisting config in HBase and loading it at
startup. Yes if we add more JSON stuff, we'll definitely want to use
a well-established JSON library such as GSON or Jackson.
--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com
I would encourage you to put everything in the "tsdb-uid" table,
despite its name. As long as we carefully choose how to craft our
keys and column qualifiers, there won't be any collisions.
Rollup and aggregation rules will probably be a per-metric thing, so
it makes sense to store them along whatever other meta-data we'll add
to "tsdb-uid" for metrics.
Users are the only thing that would look weird in the "tsdb-uid"
table, but really at this point it would be better to not add a new
table.
This is true, I regret not calling this table "tsdb-meta". But
creating a new HBase table just for a dozen configuration knobs
doesn't make sense. So re-using "tsdb-uid" is fine.
Table names are configurable anyway. New installs could be deployed
with "tsdb-meta" as the table name, and existing installs that can't
undergo a table rename (as this requires downtime) can keep using
"tsdb-uid" until their next maintenance window.
Ingest
In our network, we have multiple points-of-presence around the globe
with hundreds or thousands of machines in each location generating
data. Since HBsae doesn’t provide real-time writing across data
centers (only replication) and I worry about the stability of
connections from a TSD to a central HBase cluster over a WAN (MySQL
doesn’t like this), I would like to create a scaled down ingest daemon
that can live in each DC. The ingester would be a TSD, configured for
ingest, that would intake the raw metric data from devices in the
local DC. Then it would perform basic checks to make sure the data is
valid (does it have the proper tags, is the data numeric?). It would
spool the information to a local disk via SQLite in case the DC
becomes partitioned and can’t talk to the central servers. If the DC
was portioned off, the ingest daemon could replay the data once a
connection was re-established. Then it would push the data off to the
central brokers. This style is similar to Splunk (a really neat
product) where we install “forwarders” in different data centers that
collect data, bundle it up, and transport it to central processing
nodes.
A critical component for ingest is to also add HTTP input and I would
like to use a JSON format for this. JSON seems to be used all over the
place now and it’s very easy to work with. I would like to add a JSON
input and output format for OpenTSDB and try to get other folks to
standardize on it instead of having tons of slightly different, single
line formats ala the telnet interface or Graphite’s input methods.
We’ve been using a format like this:
{“timestamp”:epoch_time, “metric”:”metric.name”, ”value”:value, “tags”:
{”tagname1”:”tagvalue1”, “tagname2”:”tagvalue2”}}
We’ve also used bulk formats where the metric name and tags are shared
to cut down on network traffic (which isn’t usually an issue but when
you have millions of data points traveling from dozens of DCs, it
starts to become and issue)
Also as a part of this upgrade we would like to include some simple
authentication to the API including LDAP integration and shared-key
access. The shared key would be for communication between TSDs and
source tools. LDAP for users. We need this because the ingest daemons
would be running on networks (and possibly machines) open to the Net,
so we need some way to lock them down a bit. It would be an opt-in
option that users would config so we don’t break existing setups.
Metadata
I would like to add meta data cells to the “metrics” and “tagk” rows
that can be displayed in the GUI or API. I would have the TSDs loading
from HBase every 5 minutes or so (configurable). Would it be better to
have a separate cell for each value or maybe have one extra cell that
includes a JSON string with the different values? A list of cells that
I would add appears in the Schema section. All of this information
would be accessible via the API and different interfaces could access
it. We’ll have a wide range of users (from smart Ops folks to new NOC
techs to management) so we need to metadata to help them make sense of
the flood of data we’ll be tossing at them.
Thresholding/Alerting
I know a few folks in the Google group have been asking for an
alerting mechanism and I saw the Nagios pull script that lets it query
the TSDB for data. I also saw some folks talking about Esper and after
reading some more about that engine, I think it would be a really good
and easy way of providing push alerts. The user would add Esper rules
via the API and a single TSD instance (single because complex Esper
queries need all of the data to function properly) would consume
messages and compare the data against the rules. When a threshold is
reached, it would perform one or more of the following actions:
• Send an email using some Java email library
• Launch a script. Users can write bridges to their own monitoring
infrastructure. We would create one off the bat to interface with
Zabbix.
• Log it to HBase for auditing later (this happens regardless)
Esper is mature, open source and written in Java. If users want, they
can plugin a different engine.
GUI
It sounds like a few folks are working on different GUIs for OpenTSDB
so I was thinking of divorcing the HTTP GUI from the TSD and letting
it access data only via the API. That way everyone could write
whatever interface they want and easily integrate it into their
existing control panels.
Part of the API calls I want to add include support for saving graphs
and creating dashboards for different users. I envision using Graphite
to let users design their dashboards and it would write the settings
back to HBase via the TSD API.
MQ
To maintain horizontal scalability and redundancy, I would like to
modularize the TSDs where users can install the TSD on a box and
configure it to perform whatever roll they want. MQ is an ideal
solution because it’s also horizontally scalable. The OpenTSDB modules
would hang off the MQ bus and each module (that works with data) would
setup a different queue (based on it’s roll) to receive data in real-
time. We’ve used RabbitMQ with a lot of success and I would start with
that for MQ work but we could plugin any MQ engine. This means we
don’t need a ton of inter-process communication but for instances
where we do need it, we can push messages back into the MQ brokers for
delivery.
Another possibility is using Storm, which looks very promising but I’m
not sure how it coordinates between distributed processes. Plus it
would require a good amount of work to convert the TSD into a
distributable JAR.
Reporter
Part of our use requires reports being generated every so often so I
would write a schema in HBase for storing configuration data. Then a
daemon would run and spit the report data out via email and/or scripts
for ingestion in other services.
General
Configuration – I would like to use HBase for centralized
configuration system. The main API TSDs would have to have a config
file pointing them at the HBase cluster. On startup, these would pull
their config data directly from HBase. Then all the other modules
would simply require a config file with a list of API hosts (and
authentication) to connect to to pull their config info.
[...]
[...]
In another discussion there were talks about canceling long running queries. With such a kick-ass data-API, you could be able to run async or streaming queries and send a cancel for the request if it is taking too long.
The current HTTP api would have no clue which request you mean if you send an update (like cancel) for an earlier request.
Another long running query could be a full text search on metric names (which is currently not a good idea if you have a lot of metrics and HBase does prefix searching apparently).
On Tuesday, 24 January 2012 15:52:43 UTC+1, ManOLamancha wrote:Ingest
IMHO, this could (and probably shoud) be a separate project. From my experience with Etsy's StatsD, such servers can be written in a fairly simple manner.
And why spool in SQLite? Wouldn't appending to plain files be simpler? (And, when all goes to hell, they could be bulk-imported to tsdb later on.
As to what to send over HTTP, a client could easily set a 'Content-type: application/json'-header for JSON, and other types for whatever else it accepts.
Authentication in the HTTP/Presentation-layer could easily be done via a proxy (and you need a perimeter-host anyway, given Hadoops lacking intra-cluster auth)
RollupsAggregates
I might have missed something here, but I don't see any big difference between rollups and the aggregates-parts; they both read like persisted/pre-warmed caches to me.
While caching most certainly would make sense for historical data, I see a lot of complexity in making sure the cache actually contain something useful.Just for an example, I don't think min/max/avg rollups are enough; I want percentiles, standard deviation and medians. And for quite a few types of data, a plain sum just makes more sense - so that will either have to be configured somewhere, or use a good deal of disk storing all the un-used aggregates/rollups...
Thresholding/Alerting
Couldn't Esper be another back-end/consumer, just like the HBase and Cassandra writers?My only wish is that this sort of stuff is done as plug-ins of some sort, as I expect quite a few users either will have their own alerting-infrastucture or wish to use something else.
Wait, what? You want to put a dashboard-saving API into OpenTSDB? Why not focus on giving OpenTSDB a kick-ass data-API, and then let each dashboard/interface/consumer/whatever do their own storage.I'm all for borrowing the Graphite web-interface (it is amazingly good on Carbon), but I think it should rather have a thin shim/proxy for serving data, rather than integrating it directly into OpenTSDB.
MQ
Or do you want to put all the actual data into RabbitMQ? If so, why not a back-end plug-in that just outputs to RabbitMQ? And I guess it wouldn't be too hard to write a small proxy that listens for RabbitMQ-messages and forwards them to a TSD.
Modules
Sticking to the UNIX-philosophy, couldn't a lot of this be put in separate projects? Granted, deeply integrating such tools has it's advantages, but it makes the whole project much more complex.
General
Configuration – I would like to use HBase for centralized
configuration system. The main API TSDs would have to have a config
file pointing them at the HBase cluster. On startup, these would pull
their config data directly from HBase. Then all the other modules
would simply require a config file with a list of API hosts (and
authentication) to connect to to pull their config info.Again, I believe ZooKeeper is built for this exact purpose.
All in all, I see a lot of stuff that would be solved by having some sort of pluggable front/back-ends.- Keep "core" OpenTSDB small, lean and, hopefully, bug-free (and we bother tsuna less)- No dependency-hell for simple installations.- Allow for interfacing with proprietary stuff without having to open-source all of it.