On Monday, September 25, 2017 at 12:27:34 AM UTC-5,
minf...@arcor.de
wrote:
>
> Is this just experiments using laptop-generated series as
> proof-of-concept?
>
I'm interested in time-series databases. I looked around and didn't
find any "principles of time series data", "write your own tsdb from
scratch", etc. I found some installation guides basically, but neither
theory nor practice on how such databases actually work, what
performance or resource problems people run into with them, etc.
> In the SCADA world, time series management / compression / storage /
> math
> is quite another beast.
I'd hope that any practical use of time series data is a beast apart
from "dump all points" and "get the average of one value across all
points". Or do you mean that you couldn't use something like influxdb
for SCADA? Its capabilities are documented here:
https://docs.influxdata.com/influxdb/v1.3/
I don't know anything about SCADA though. How about, I'll tell you a
little bit about what I've been doing, and you can say how that
differs from things in the SCADA world.
For about six months out of this last year I was responsible for some
projects related to time-series data. This started out with my taking
over some completely custom webapps whose function was to consume
external monitoring and to, when monitoring said things were bad,
display a lot of red boxes so that people would start making phone
calls. Early on I adopted influxdb for storing time-series data and
grafana for displaying it.
There were two big sources of data: our internal monitoring system
(let's call it, to not offend anyone, Zabbix--because that's what it
is and Zabbix is great), and an external monitoring system (let's call
it, to not offend anyone, Cassanda--because they use Apache Cassandra
and their performance characteristics are 'interesting'). Zabbix and
Cassandra both monitored a really big server farm. Mainly Cassandra
would connect to two differently configured websites running on each
server and would report on how fast each site loaded from whatever
regional requester, or if it failed to load how long the outage
lasted. Zabbix meanwhile maintained thousands of items per server and
would alert on dozens of issues--a service not running, or a service
not being remotely accessible, or a service appearing to have
performance problems, etc.
I couldn't just straightforwardly get time-series data from either
system. For Zabbix I had a daemon that listened to the 'event
firehose', keeping track of all active alerts, which would regularly
report to influxdb about those alerts when they were ongoing and then
would report when they'd closed. In influxdb's line format these
alerts might look something like:
alert,server=
big43.bigserver.com,dc=Rome,brand=BigHost,zabbix=zbx1,platform=Big--Shared--Europe,acked=null,description=Nginx\ doesn\'t\ like\ its\ configuration value=120 norm=0.0012 1506398311
So, add two values at that Unix epoch timestamp to the 'alert'
measurement in whatever database, the first named 'value' with a value
of 120 (seconds: how long the alert has lasted so far, or lasted
ever); and the second named 'norm' with a value of 0.0012 (seconds:
just value divided by the number of servers in the group, here 1000).
The rest of that is tags. You could then ask influxdb a question like
SELECT count(value) FROM alert WHERE dc=Rome AND time > now() - 7d GROUP BY time(1d), brand
and the result could be graphed, with one line per brand: how many
alerts by day did we see from this brand in the last week from the
datacenter in Rome? The graph would have time on the X axis, there
would one point per day (per brand), and the lines would connect those
points.
For Cassandra, the external monitoring, I made API calls to ask them
for various reports, then massaged the data into influxdb. The
massaged data might look something like:
outage,server=
big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=1 1506397759
outage,server=
big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=1 1506398759
outage,server=
big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=1 1506398859
availability,server=
big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=99.98 1506398859
performance,server=
big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe--Cached,continent=NorthAmerica,state=VA value=120 1506398859
performance,server=
big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe,continent=NorthAmerica,state=VA value=330 1506398859
With outages stored as a single value=1 without duration, "something
happened at this time". With uptime stored as a float [0,100]. With
performance stored in milliseconds to page load, with that continent=
stuff indicating which site *initiated* the request, not the location
of the server.
One cute thing you could do with this data is grab all of the
performance data for all servers everywhere, and then demonstrate
that, regrettably, the speed of light is still in effect: most of our
stuff is in NA, therefore NA speeds are on average the best, followed
by Europe, then South America:
SELECT mean(value) FROM performance WHERE time > now() - 30d GROUP BY time(1d), continent
The purpose was trending, mainly: when things turn bad, someone should
notice and ask questions about it. What is measured, improves. Which
only ever needed the last month of data, mainly.
A secondary purpose: "showing that a thing was real". One part of the
company gets some complaints about customers on the 4th of July. Is
this a big fire or can people still enjoy their holidays? Well this
graph shows that yes there's a huge spike in problems related to the
system the customers are complaining about, sorry. Again this only
needs the most recent data.
An uncommon purpose: "did we do well? / what the hell happened?" It's
noticed after the fact that some metric has slipped, or a team
introduced a new feature and would like to see what impact it made,
and someone gets to prepare, in May, a dashboard that compares some
numbers from February vs. those from March.
The most advanced math was turning uptime figures into periods of
time, and vice-versa. I could for example produce an 'Nginx
Configuration Uptime' for a platform, for a day or week, based on how
long the nginx configuration alerts lasted for servers on that
platform.
Compression and storage issues didn't come up. Hundreds of millions of
points just don't take up enough space to care about. I was so
conservative with Cassandra's influxdb usage that an entire year of
it probably wouldn't take up 1G on the disk. But that's influxdb
handling things--maybe it actually compresses data on the disk, I
don't know.
Performance issues, and influxdb regrettably requiring infinite
amounts of RAM to answer a query, those did come up. I was much less
conservative with Zabbix's use of influxdb. I the end every single
grafana panel that needed Zabbix data had to deal with, not the raw
measurements, but aggregate measurements built by infuxdb 'continuous
queries'.
So, how's the SCADA world?