Hi Wes,
I'd say if all your stats data are coming slower than say N seconds, then don't allow to collect on such a small time interval.. -- there is no magic, if data are missed for a given period of time, or between two points in time - they are really missed, and you cannot speculate here that a missed value could be represented as an Avg between the latest before and after.. - nobody knows what it could be in reality, as it could be very low or very high with the same probability ;-)
Personally, when I need stats on a different time intervals, I just starting 2 parallel collects from the same host: one is collecting all "light" stats on say 10sec time interval, and another collecting "heavy" stats say on 30sec interval. In this case I may at least trust what I'm seeing ;-)
However, you may also adapt your Add-On script in a way it provides you a regular output even if backgroud stats are lagging :
- for ex. your stat command delivers you data on unexpected time intervals, say between 5 and 45 sec (or more) - you never know
- but you need to see corresponding stats printed every N sec..
- you're writing then a "wrapper" script which is reading itself the data coming from these stats command(s) and printing every N sec "something"
- this "something" then will be one of two :
- real data (if stats provided the info you need within N sec interval)
- estimated data (if no stats came within N sec)..
- for estimation you can use for ex. an Avg of the last 20 values (or more)
- it's important then also to have an additional column in the output used as a flag to say which data are printed (say: 1 for "real", and 0 for "estimated") -- after that it'll be very easy to say how real are your stats by looking on the "data" graph aligned with "flag" graph on the same time..
Now: why dim_STAT could not do exactly the same?.. -- simply because it's always much more simple to implement this on the Add-On scrip level rather invent a kind of "universal" solution ;-)
So, from the beginning there was no any time dependency in the code -- each data metric is just getting its serial number (SNO) attributed, which is then correlated with collect start time to show the corresponding time.. Stats commands may lag for many reasons, or may just have a buffered output -- you never know, so it's always more healthy to delegate it to the Add-On itself, rather debug a "general" logic in each broken case ;-)
However, dim_STAT is still taking care about lagging stats :
- if there was still no output from the stat during x3 time intervals - the stat collection will be restarted
- if connection was lost, the new SNO will be adjusted according the new time on re-connect
- every hour there is a check for measurements lagging in every stat, and if several measurements are missed, such a stat is re-started with a new SNO (say you're collecting vmstat every 10 sec, means within 1hour you should find 360 measurements, and if it's not the case -- means vmstat was lagging somewhere for some reasons (for ex. your server was too busy with 100% CPU usage for a long time, etc.) -- so, vmstat will be restarted and the output re-aligned to the current time)..
- NOTE: every time there was a drop or shift in time in data you'll see a vertical red dotted line to mention this..
Let me know if you need any more info, etc..
Rgds,
-Dimitri